tencent cloud

Feedback

Rule Type Description

Last updated: 2022-05-19 10:50:52

    TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.

    Kubernetes master component

    The following metrics are provided for non-managed clusters:

    Rule Name Rule Expression Duration Description
    Error with client access to APIServer (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01 15m The error rate of client access to the APIServer is above 1%
    Imminent expiration of the client certificate for APIServer access apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 None The client certificate for APIServer access will expire in 24 hours
    Recording API error sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 None The recording API reported an error in the last 5 minutes
    Low recording API availability (1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90 5m The availability of the recording API service in the last 5 minutes was below 90%
    APIServer fault absent(sum(up{job="apiserver"}) by (cluster_id) > 0) 5m APIServer disappeared from the collection targets
    Scheduler fault absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0) 15m The scheduler disappeared from the collection targets
    Controller manager fault absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0) 15m The controller manager disappeared from the collection targets

    Kubelet

    Rule Name Rule Expression Duration Description
    Exceptional node status kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0 15m The node status is exceptional for over 15 minutes
    Unreachable node kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 15m The node is unreachable, and its workload will be scheduled again
    Too many Pods running on node count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95 15m The number of Pods running on the node is close to the upper limit
    Node status fluctuation sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2 15m The node status fluctuates between normal and exceptional
    Imminent expiration of the kubelet client certificate kubelet_certificate_manager_client_ttl_seconds < 86400 None The kubelet client certificate will expire in 24 hours
    Imminent expiration of the kubelet server certificate kubelet_certificate_manager_server_ttl_seconds < 86400 None The kubelet server certificate will expire in 24 hours
    Kubelet client certificate renewal error increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0 15m An error occurred while renewing the kubelet client certificate
    Kubelet server certificate renewal error increase(kubelet_server_expiration_renew_errors[5m]) > 0 15m An error occurred while renewing the kubelet server certificate
    Time-Consuming PLEG histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10 5m The 99th percentile of PLEG operation duration exceeds 10 seconds
    Time-Consuming Pod start histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60 15m The 99th percentile of Pod start duration exceeds 60 seconds
    Kubelet fault absent(sum(up{job="kubelet"}) by (cluster_id) > 0) 15m Kubelet disappeared from the collection targets

    Kubernetes Resource Use

    Rule Name Rule Expression Duration Description
    Cluster CPU resource overload sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores) 5m Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
    Cluster memory resource overload sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes) 5m Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
    Cluster CPU quota overload sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5 5m The CPU quota in the cluster exceeds the total number of allocable CPU cores
    Cluster memory quota overload sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) / sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5 5m The memory quota in the cluster exceeds the total amount of allocable memory
    Imminent runout of quota resources sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9 15m The quota resource utilization exceeds 90%
    High proportion of restricted CPU execution cycles sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 ) 15m The proportion of restricted CPU execution cycles is high
    High Pod CPU utilization sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75 15m The Pod CPU utilization exceeds 75%
    High Pod memory utilization sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75 15m The Pod memory utilization exceeds 75%

    Kubernetes Workload

    Rule Name Rule Expression Duration Description
    Frequent Pod restarts increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0 15m The Pod was frequently restarted in the last 5 minutes
    Exceptional Pod status sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0 15m The Pod is in the `NotReady` status for over 15 minutes
    Exceptional container status sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0 1h The container is in the `Waiting` status for a long period of time
    Deployment version mismatch kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"} 15m The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
    Deployment replica quantity mismatch (kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0) 15m The actual number of replicas is different from the set number of replicas
    StatefulSet version mismatch kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"} 15m The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
    StatefulSet replica quantity mismatch (kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0) 15m The actual number of replicas is different from the set number of replicas
    Ineffective StatefulSet update (maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unlesskube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0) 15m The StatefulSet hasn't been updated on some Pods
    Frozen DaemonSet change ((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0) 15m The DaemonSet change lasts more than 15 minutes
    DaemonSet not scheduled on some nodes kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0 10m The DaemonSet is not scheduled on some nodes
    Faulty scheduling of DaemonSet on some nodes kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0 15m The DaemonSet is incorrectly scheduled to some nodes
    Excessive Job execution kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"} > 0 12h The execution duration of the Job exceeds 12 hours
    Job execution failure kube_job_failed{job=~".*kube-state-metrics"} > 0 15m Job execution failed
    Mismatch between replica quantity and HPA (kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 15m The actual number of replicas is different from that set in HPA
    Number of replicas reaching maximum value in HPA kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"} 15m The actual number of replicas reaches the maximum value configured in HPA
    Exceptional PersistentVolume status kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0 15m The PersistentVolume is in the `Failed` or `Pending` status

    Kubernetes Node

    Rule Name Rule Expression Duration Description
    Imminent runout of filesystem space (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) 1h It is estimated that the filesystem space will be used up in 4 hours
    High filesystem space utilization (node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) 1h The available filesystem space is below 5%
    Imminent runout of filesystem inodes (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) 1h It is estimated that the filesystem inodes will be used up in 4 hours
    High filesystem inode utilization (node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0) 1h The proportion of available inodes is below 3%
    Unstable network interface status changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) 2m The network interface status is unstable and frequently changes between "up" and "down"
    Network interface data reception error increase(node_network_receive_errs_total[2m]) > 10 1h An error occurred while the network interface received data
    Network interface data sending error increase(node_network_transmit_errs_total[2m]) > 10 1h An error occurred while the network interface sent data
    Unsynced server clock min_over_time(node_timex_sync_status[5m]) == 0 10m The server time has not been synced recently. Please check whether NTP is correctly configured
    Server clock skew (node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0) 10m The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support