tencent cloud

Tencent Cloud Observability Platform

Release Notes and Announcements
Release Notes
Product Introduction
Overview
Strengths
Basic Features
Basic Concepts
Use Cases
Use Limits
Purchase Guide
Tencent Cloud Product Monitoring
Application Performance Management
Mobile App Performance Monitoring
Real User Monitoring
Cloud Automated Testing
Prometheus Monitoring
Grafana
EventBridge
PTS
Quick Start
Monitoring Overview
Instance Group
Tencent Cloud Product Monitoring
Application Performance Management
Real User Monitoring
Cloud Automated Testing
Performance Testing Service
Prometheus Getting Started
Grafana
Dashboard Creation
EventBridge
Alarm Service
Cloud Product Monitoring
Tencent Cloud Service Metrics
Operation Guide
CVM Agents
Cloud Product Monitoring Integration with Grafana
Troubleshooting
Practical Tutorial
Application Performance Management
Product Introduction
Access Guide
Operation Guide
Practical Tutorial
Parameter Information
FAQs
Mobile App Performance Monitoring
Overview
Operation Guide
Access Guide
Practical Tutorial
Tencent Cloud Real User Monitoring
Product Introduction
Operation Guide
Connection Guide
FAQs
Cloud Automated Testing
Product Introduction
Operation Guide
FAQs
Performance Testing Service
Overview
Operation Guide
Practice Tutorial
JavaScript API List
FAQs
Prometheus Monitoring
Product Introduction
Access Guide
Operation Guide
Practical Tutorial
Terraform
FAQs
Grafana
Product Introduction
Operation Guide
Guide on Grafana Common Features
FAQs
Dashboard
Overview
Operation Guide
Alarm Management
Console Operation Guide
Troubleshooting
FAQs
EventBridge
Product Introduction
Operation Guide
Practical Tutorial
FAQs
Report Management
FAQs
General
Alarm Service
Concepts
Monitoring Charts
CVM Agents
Dynamic Alarm Threshold
CM Connection to Grafana
Documentation Guide
Related Agreements
Application Performance Management Service Level Agreement
APM Privacy Policy
APM Data Processing And Security Agreement
RUM Service Level Agreement
Mobile Performance Monitoring Service Level Agreement
Cloud Automated Testing Service Level Agreement
Prometheus Service Level Agreement
TCMG Service Level Agreements
PTS Service Level Agreement
PTS Use Limits
Cloud Monitor Service Level Agreement
API Documentation
History
Introduction
API Category
Making API Requests
Monitoring Data Query APIs
Alarm APIs
Legacy Alert APIs
Notification Template APIs
TMP APIs
Grafana Service APIs
Event Center APIs
TencentCloud Managed Service for Prometheus APIs
Monitoring APIs
Data Types
Error Codes
Glossary

Rule Type Description(old)

PDF
Modo Foco
Tamanho da Fonte
Última atualização: 2025-10-24 14:48:01
TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.

Kubernetes master component

The following metrics are provided for non-managed clusters:
Rule Name
Rule Expression
Duration
Description
Error with client access to APIServer
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
15m
The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
None
The client certificate for APIServer access will expire in 24 hours
Recording API error
sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
None
The recording API reported an error in the last 5 minutes
Low recording API availability
(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
5m
The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault
absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
5m
APIServer disappeared from the collection targets
Scheduler fault
absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
15m
The scheduler disappeared from the collection targets
Controller manager fault
absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
15m
The controller manager disappeared from the collection targets

Kubelet

Rule Name
Rule Expression
Duration
Description
Exceptional node status
kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
15m
The node status is exceptional for over 15 minutes
Unreachable node
kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
15m
The node is unreachable, and its workload will be scheduled again
Too many Pods running on node
count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
15m
The number of Pods running on the node is close to the upper limit
Node status fluctuation
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
15m
The node status fluctuates between normal and exceptional
Imminent expiration of the kubelet client certificate
kubelet_certificate_manager_client_ttl_seconds < 86400
None
The kubelet client certificate will expire in 24 hours
Imminent expiration of the kubelet server certificate
kubelet_certificate_manager_server_ttl_seconds < 86400
None
The kubelet server certificate will expire in 24 hours
Kubelet client certificate renewal error
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet client certificate
Kubelet server certificate renewal error
increase(kubelet_server_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet server certificate
Time-Consuming PLEG
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
5m
The 99th percentile of PLEG operation duration exceeds 10 seconds
Time-Consuming Pod start
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
15m
The 99th percentile of Pod start duration exceeds 60 seconds
Kubelet fault
absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
15m
Kubelet disappeared from the collection targets

Kubernetes Resource Use

Rule Name
Rule Expression
Duration
Description
Cluster CPU resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
5m
Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster memory resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
5m
Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster CPU quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
5m
The CPU quota in the cluster exceeds the total number of allocable CPU cores
Cluster memory quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) / sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
5m
The memory quota in the cluster exceeds the total amount of allocable memory
Imminent runout of quota resources
sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
15m
The quota resource utilization exceeds 90%
High proportion of restricted CPU execution cycles
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
15m
The proportion of restricted CPU execution cycles is high
High Pod CPU utilization
sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod CPU utilization exceeds 75%
High Pod memory utilization
sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod memory utilization exceeds 75%

Kubernetes Workload

Rule Name
Rule Expression
Duration
Description
Frequent Pod restarts
increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
15m
The Pod was frequently restarted in the last 5 minutes
Exceptional Pod status
sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
15m
The Pod is in the `NotReady` status for over 15 minutes
Exceptional container status
sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
1h
The container is in the `Waiting` status for a long period of time
Deployment version mismatch
kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
15m
The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
Deployment replica quantity mismatch
(kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
StatefulSet version mismatch
kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
15m
The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
StatefulSet replica quantity mismatch
(kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
Ineffective StatefulSet update
(maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
15m
The StatefulSet hasn't been updated on some Pods
Frozen DaemonSet change
((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
15m
The DaemonSet change lasts more than 15 minutes
DaemonSet not scheduled on some nodes
kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
10m
The DaemonSet is not scheduled on some nodes
Faulty scheduling of DaemonSet on some nodes
kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
15m
The DaemonSet is incorrectly scheduled to some nodes
Excessive Job execution
kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"} > 0
12h
The execution duration of the Job exceeds 12 hours
Job execution failure
kube_job_failed{job=~".*kube-state-metrics"} > 0
15m
Job execution failed
Mismatch between replica quantity and HPA
(kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
15m
The actual number of replicas is different from that set in HPA
Number of replicas reaching maximum value in HPA
kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
15m
The actual number of replicas reaches the maximum value configured in HPA
Exceptional PersistentVolume status
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
15m
The PersistentVolume is in the `Failed` or `Pending` status

Kubernetes Node

Rule Name
Rule Expression
Duration
Description
Imminent runout of filesystem space
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem space will be used up in 4 hours
High filesystem space utilization
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The available filesystem space is below 5%
Imminent runout of filesystem inodes
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem inodes will be used up in 4 hours
High filesystem inode utilization
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The proportion of available inodes is below 3%
Unstable network interface status
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
2m
The network interface status is unstable and frequently changes between "up" and "down"
Network interface data reception error
increase(node_network_receive_errs_total[2m]) > 10
1h
An error occurred while the network interface received data
Network interface data sending error
increase(node_network_transmit_errs_total[2m]) > 10
1h
An error occurred while the network interface sent data
Unsynced server clock
min_over_time(node_timex_sync_status[5m]) == 0
10m
The server time has not been synced recently. Please check whether NTP is correctly configured
Server clock skew
(node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
10m
The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured


Ajuda e Suporte

Esta página foi útil?

comentários