Rule Type Description(old)

Recent Pages

Rule Type Description(old)

Last updated: 2024-01-29 16:01:55

TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.
Kubernetes master component
The following metrics are provided for non-managed clusters:
Rule Name
Rule Expression
Duration
Description
Error with client access to APIServer
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
15m
The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
None
The client certificate for APIServer access will expire in 24 hours
Recording API error
sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
None
The recording API reported an error in the last 5 minutes
Low recording API availability
(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
5m
The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault
absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
5m
APIServer disappeared from the collection targets
Scheduler fault
absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
15m
The scheduler disappeared from the collection targets
Controller manager fault
absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
15m
The controller manager disappeared from the collection targets
Kubelet
Rule Name
Rule Expression
Duration
Description
Exceptional node status
kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
15m
The node status is exceptional for over 15 minutes
Unreachable node
kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
15m
The node is unreachable, and its workload will be scheduled again
Too many Pods running on node
count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
15m
The number of Pods running on the node is close to the upper limit
Node status fluctuation
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
15m
The node status fluctuates between normal and exceptional
Imminent expiration of the kubelet client certificate
kubelet_certificate_manager_client_ttl_seconds < 86400
None
The kubelet client certificate will expire in 24 hours
Imminent expiration of the kubelet server certificate
kubelet_certificate_manager_server_ttl_seconds < 86400
None
The kubelet server certificate will expire in 24 hours
Kubelet client certificate renewal error
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet client certificate
Kubelet server certificate renewal error
increase(kubelet_server_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet server certificate
Time-Consuming PLEG
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
5m
The 99th percentile of PLEG operation duration exceeds 10 seconds
Time-Consuming Pod start
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
15m
The 99th percentile of Pod start duration exceeds 60 seconds
Kubelet fault
absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
15m
Kubelet disappeared from the collection targets
Kubernetes Resource Use
Rule Name
Rule Expression
Duration
Description
Cluster CPU resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
5m
Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster memory resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
5m
Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster CPU quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
5m
The CPU quota in the cluster exceeds the total number of allocable CPU cores
Cluster memory quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) /  sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
5m
The memory quota in the cluster exceeds the total amount of allocable memory
Imminent runout of quota resources
sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
15m
The quota resource utilization exceeds 90%
High proportion of restricted CPU execution cycles
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
15m
The proportion of restricted CPU execution cycles is high
High Pod CPU utilization
sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod CPU utilization exceeds 75%
High Pod memory utilization
sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod memory utilization exceeds 75%
Kubernetes Workload
Rule Name
Rule Expression
Duration
Description
Frequent Pod restarts
increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
15m
The Pod was frequently restarted in the last 5 minutes
Exceptional Pod status
sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
15m
The Pod is in the `NotReady` status for over 15 minutes
Exceptional container status
sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
1h
The container is in the `Waiting` status for a long period of time
Deployment version mismatch
kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
15m
The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
Deployment replica quantity mismatch
(kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
StatefulSet version mismatch
kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
15m
The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
StatefulSet replica quantity mismatch
(kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
Ineffective StatefulSet update
(maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
15m
The StatefulSet hasn't been updated on some Pods
Frozen DaemonSet change
((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
15m
The DaemonSet change lasts more than 15 minutes
DaemonSet not scheduled on some nodes
kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
10m
The DaemonSet is not scheduled on some nodes
Faulty scheduling of DaemonSet on some nodes
kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
15m
The DaemonSet is incorrectly scheduled to some nodes
Excessive Job execution
kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"}  > 0
12h
The execution duration of the Job exceeds 12 hours
Job execution failure
kube_job_failed{job=~".*kube-state-metrics"}  > 0
15m
Job execution failed
Mismatch between replica quantity and HPA
(kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
15m
The actual number of replicas is different from that set in HPA
Number of replicas reaching maximum value in HPA
kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
15m
The actual number of replicas reaches the maximum value configured in HPA
Exceptional PersistentVolume status
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
15m
The PersistentVolume is in the `Failed` or `Pending` status
Kubernetes Node
Rule Name
Rule Expression
Duration
Description
Imminent runout of filesystem space
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem space will be used up in 4 hours
High filesystem space utilization
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The available filesystem space is below 5%
Imminent runout of filesystem inodes
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem inodes will be used up in 4 hours
High filesystem inode utilization
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The proportion of available inodes is below 3%
Unstable network interface status
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
2m
The network interface status is unstable and frequently changes between "up" and "down"
Network interface data reception error
increase(node_network_receive_errs_total[2m]) > 10
1h
An error occurred while the network interface received data
Network interface data sending error
increase(node_network_transmit_errs_total[2m]) > 10
1h
An error occurred while the network interface sent data
Unsynced server clock
min_over_time(node_timex_sync_status[5m]) == 0
10m
The server time has not been synced recently. Please check whether NTP is correctly configured
Server clock skew
(node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
10m
The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured
﻿

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

7x24 Phone Support

tencent cloud

Recent Pages

Rule Type Description(old)

Kubernetes master component

Kubelet

Kubernetes Resource Use

Kubernetes Workload

Kubernetes Node

Was this page helpful?

Was this page helpful?

Rule Name	Rule Expression	Duration	Description
Error with client access to APIServer	(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01	15m	The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access	apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400	None	The client certificate for APIServer access will expire in 24 hours
Recording API error	sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2	None	The recording API reported an error in the last 5 minutes
Low recording API availability	(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90	5m	The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault	absent(sum(up{job="apiserver"}) by (cluster_id) > 0)	5m	APIServer disappeared from the collection targets
Scheduler fault	absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)	15m	The scheduler disappeared from the collection targets
Controller manager fault	absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)	15m	The controller manager disappeared from the collection targets

tencent cloud

Sign Up

Log in

Recent Pages

Rule Type Description(old)

Kubernetes master component

Kubelet

Kubernetes Resource Use

Kubernetes Workload

Kubernetes Node

Was this page helpful?

Was this page helpful?