Release Notes and Announcements

Release Notes

Product Introduction

Overview

Strengths

Basic Features

Basic Concepts

Use Cases

Use Limits

Purchase Guide

Tencent Cloud Product Monitoring

Application Performance Management

Mobile App Performance Monitoring

Real User Monitoring

Cloud Automated Testing

Prometheus Monitoring

Grafana

EventBridge

PTS

Quick Start

Monitoring Overview

Instance Group

Tencent Cloud Product Monitoring

Application Performance Management

Real User Monitoring

Cloud Automated Testing

Performance Testing Service

Prometheus Getting Started

Grafana

Dashboard Creation

EventBridge

Alarm Service

Cloud Product Monitoring

Tencent Cloud Service Metrics

Operation Guide

CVM Agents

Cloud Product Monitoring Integration with Grafana

Troubleshooting

Practical Tutorial

Application Performance Management

Product Introduction

Access Guide

Operation Guide

Practical Tutorial

Parameter Information

FAQs

Mobile App Performance Monitoring

Overview

Operation Guide

Access Guide

Practical Tutorial

Tencent Cloud Real User Monitoring

Product Introduction

Operation Guide

Connection Guide

FAQs

Cloud Automated Testing

Product Introduction

Operation Guide

FAQs

Performance Testing Service

Overview

Operation Guide

Practice Tutorial

JavaScript API List

FAQs

Prometheus Monitoring

Product Introduction

Access Guide

Operation Guide

Practical Tutorial

Terraform

FAQs

Grafana

Product Introduction

Operation Guide

Guide on Grafana Common Features

FAQs

Dashboard

Overview

Operation Guide

Alarm Management

Console Operation Guide

Troubleshooting

FAQs

EventBridge

Product Introduction

Operation Guide

Practical Tutorial

FAQs

Report Management

FAQs

General

Alarm Service

Concepts

Monitoring Charts

CVM Agents

Dynamic Alarm Threshold

CM Connection to Grafana

Documentation Guide

Related Agreements

Application Performance Management Service Level Agreement

APM Privacy Policy

APM Data Processing And Security Agreement

RUM Service Level Agreement

Mobile Performance Monitoring Service Level Agreement

Cloud Automated Testing Service Level Agreement

Prometheus Service Level Agreement

TCMG Service Level Agreements

PTS Service Level Agreement

PTS Use Limits

Cloud Monitor Service Level Agreement

API Documentation

History

Introduction

API Category

Making API Requests

Monitoring Data Query APIs

Alarm APIs

Legacy Alert APIs

Notification Template APIs

TMP APIs

Grafana Service APIs

Event Center APIs

TencentCloud Managed Service for Prometheus APIs

Monitoring APIs

Data Types

Error Codes

Glossary

Rule Type Description(old)

PDF

Modo Foco

Tamanho da Fonte

Última atualização: 2025-10-24 14:48:01

TMP presets the master component, kubelet, resource use, workload, and node alert templates for TKE clusters.
Kubernetes master component
The following metrics are provided for non-managed clusters:
Rule Name
Rule Expression
Duration
Description
Error with client access to APIServer
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01
15m
The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
None
The client certificate for APIServer access will expire in 24 hours
Recording API error
sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2
None
The recording API reported an error in the last 5 minutes
Low recording API availability
(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90
5m
The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault
absent(sum(up{job="apiserver"}) by (cluster_id) > 0)
5m
APIServer disappeared from the collection targets
Scheduler fault
absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)
15m
The scheduler disappeared from the collection targets
Controller manager fault
absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)
15m
The controller manager disappeared from the collection targets
Kubelet
Rule Name
Rule Expression
Duration
Description
Exceptional node status
kube_node_status_condition{job=~".*kube-state-metrics",condition="Ready",status="true"} == 0
15m
The node status is exceptional for over 15 minutes
Unreachable node
kube_node_spec_taint{job=~".*kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
15m
The node is unreachable, and its workload will be scheduled again
Too many Pods running on node
count by(cluster_id, node) ((kube_pod_status_phase{job=~".*kube-state-metrics",phase="Running"} == 1) * on(instance,pod,namespace,cluster_id) group_left(node) topk by(instance,pod,namespace,cluster_id) (1, kube_pod_info{job=~".*kube-state-metrics"}))/max by(cluster_id, node) (kube_node_status_capacity_pods{job=~".*kube-state-metrics"} != 1) > 0.95
15m
The number of Pods running on the node is close to the upper limit
Node status fluctuation
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_id, node) > 2
15m
The node status fluctuates between normal and exceptional
Imminent expiration of the kubelet client certificate
kubelet_certificate_manager_client_ttl_seconds < 86400
None
The kubelet client certificate will expire in 24 hours
Imminent expiration of the kubelet server certificate
kubelet_certificate_manager_server_ttl_seconds < 86400
None
The kubelet server certificate will expire in 24 hours
Kubelet client certificate renewal error
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet client certificate
Kubelet server certificate renewal error
increase(kubelet_server_expiration_renew_errors[5m]) > 0
15m
An error occurred while renewing the kubelet server certificate
Time-Consuming PLEG
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) by (cluster_id, instance, le) * on(instance, cluster_id) group_left(node) kubelet_node_name{job="kubelet"}) >= 10
5m
The 99th percentile of PLEG operation duration exceeds 10 seconds
Time-Consuming Pod start
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m])) by (cluster_id, instance, le)) * on(cluster_id, instance) group_left(node) kubelet_node_name{job="kubelet"} > 60
15m
The 99th percentile of Pod start duration exceeds 60 seconds
Kubelet fault
absent(sum(up{job="kubelet"}) by (cluster_id) > 0)
15m
Kubelet disappeared from the collection targets
Kubernetes Resource Use
Rule Name
Rule Expression
Duration
Description
Cluster CPU resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores)>(count by (cluster_id) (kube_node_status_allocatable_cpu_cores)-1) / count by (cluster_id) (kube_node_status_allocatable_cpu_cores)
5m
Too many CPU cores are applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster memory resource overload
sum by (cluster_id) (max by (cluster_id, namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job=~".*kube-state-metrics"}) * on(cluster_id, namespace, pod) group_left() max by (cluster_id, namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))/sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > (count by (cluster_id) (kube_node_status_allocatable_memory_bytes)-1) / count by (cluster_id) (kube_node_status_allocatable_memory_bytes)
5m
Too much memory is applied for by Pods in the cluster, and no more failed nodes can be tolerated
Cluster CPU quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="cpu"})/sum by (cluster_id) (kube_node_status_allocatable_cpu_cores) > 1.5
5m
The CPU quota in the cluster exceeds the total number of allocable CPU cores
Cluster memory quota overload
sum by (cluster_id) (kube_resourcequota{job=~".*kube-state-metrics", type="hard", resource="memory"}) /  sum by (cluster_id) (kube_node_status_allocatable_memory_bytes) > 1.5
5m
The memory quota in the cluster exceeds the total amount of allocable memory
Imminent runout of quota resources
sum by (cluster_id, namespace, resource) kube_resourcequota{job=~".*kube-state-metrics", type="used"} / sum by (cluster_id, namespace, resource) (kube_resourcequota{job=~".*kube-state-metrics", type="hard"} > 0) >= 0.9
15m
The quota resource utilization exceeds 90%
High proportion of restricted CPU execution cycles
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (cluster_id, container, pod, namespace) /sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster_id, container, pod, namespace) > ( 25 / 100 )
15m
The proportion of restricted CPU execution cycles is high
High Pod CPU utilization
sum(rate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) / sum(kube_pod_container_resource_limits_cpu_cores) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod CPU utilization exceeds 75%
High Pod memory utilization
sum(rate(container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!="", container!="POD"}[1m])) by (cluster_id, namespace, pod, container) /sum(kube_pod_container_resource_limits_memory_bytes) by (cluster_id, namespace, pod, container) > 0.75
15m
The Pod memory utilization exceeds 75%
Kubernetes Workload
Rule Name
Rule Expression
Duration
Description
Frequent Pod restarts
increase(kube_pod_container_status_restarts_total{job=~".*kube-state-metrics"}[5m]) > 0
15m
The Pod was frequently restarted in the last 5 minutes
Exceptional Pod status
sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id) (kube_pod_status_phase{job=~".*kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
15m
The Pod is in the `NotReady` status for over 15 minutes
Exceptional container status
sum by (namespace, pod, container, cluster_id) (kube_pod_container_status_waiting_reason{job=~".*kube-state-metrics"}) > 0
1h
The container is in the `Waiting` status for a long period of time
Deployment version mismatch
kube_deployment_status_observed_generation{job=~".*kube-state-metrics"} !=kube_deployment_metadata_generation{job=~".*kube-state-metrics"}
15m
The Deployment version is different from the set version, which indicates that the Deployment change hasn't taken effect
Deployment replica quantity mismatch
(kube_deployment_spec_replicas{job=~".*kube-state-metrics"} != kube_deployment_status_replicas_available{job=~".*kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
StatefulSet version mismatch
kube_statefulset_status_observed_generation{job=~".*kube-state-metrics"} != kube_statefulset_metadata_generation{job=~".*kube-state-metrics"}
15m
The StatefulSet version is different from the set version, which indicates that the StatefulSet change hasn't taken effect
StatefulSet replica quantity mismatch
(kube_statefulset_status_replicas_ready{job=~".*kube-state-metrics"} != kube_statefulset_status_replicas{job=~".*kube-state-metrics"}) and ( changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m]) == 0)
15m
The actual number of replicas is different from the set number of replicas
Ineffective StatefulSet update
(maxwithout(revision)(kube_statefulset_status_current_revision{job=~".*kube-state-metrics"}unless kube_statefulset_status_update_revision{job=~".*kube-state-metrics"})*(kube_statefulset_replicas{job=~".*kube-state-metrics"}!=kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"})) and (changes(kube_statefulset_status_replicas_updated{job=~".*kube-state-metrics"}[5m])==0)
15m
The StatefulSet hasn't been updated on some Pods
Frozen DaemonSet change
((kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"}!=0) or (kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"}) or (kube_daemonset_status_number_available{job=~".*kube-state-metrics"}!=kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"})) and (changes(kube_daemonset_updated_number_scheduled{job=~".*kube-state-metrics"}[5m])==0)
15m
The DaemonSet change lasts more than 15 minutes
DaemonSet not scheduled on some nodes
kube_daemonset_status_desired_number_scheduled{job=~".*kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job=~".*kube-state-metrics"} > 0
10m
The DaemonSet is not scheduled on some nodes
Faulty scheduling of DaemonSet on some nodes
kube_daemonset_status_number_misscheduled{job=~".*kube-state-metrics"} > 0
15m
The DaemonSet is incorrectly scheduled to some nodes
Excessive Job execution
kube_job_spec_completions{job=~".*kube-state-metrics"} - kube_job_status_succeeded{job=~".*kube-state-metrics"}  > 0
12h
The execution duration of the Job exceeds 12 hours
Job execution failure
kube_job_failed{job=~".*kube-state-metrics"}  > 0
15m
Job execution failed
Mismatch between replica quantity and HPA
(kube_hpa_status_desired_replicas{job=~".*kube-state-metrics"} != kube_hpa_status_current_replicas{job=~".*kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0
15m
The actual number of replicas is different from that set in HPA
Number of replicas reaching maximum value in HPA
kube_hpa_status_current_replicas{job=~".*kube-state-metrics"} == kube_hpa_spec_max_replicas{job=~".*kube-state-metrics"}
15m
The actual number of replicas reaches the maximum value configured in HPA
Exceptional PersistentVolume status
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kube-state-metrics"} > 0
15m
The PersistentVolume is in the `Failed` or `Pending` status
Kubernetes Node
Rule Name
Rule Expression
Duration
Description
Imminent runout of filesystem space
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<15 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem space will be used up in 4 hours
High filesystem space utilization
(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}/node_filesystem_size_bytes{job="node-exporter",fstype!=""}*100<5 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The available filesystem space is below 5%
Imminent runout of filesystem inodes
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<20 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],4*60*60)<0 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
It is estimated that the filesystem inodes will be used up in 4 hours
High filesystem inode utilization
(node_filesystem_files_free{job="node-exporter",fstype!=""}/node_filesystem_files{job="node-exporter",fstype!=""}*100<3 and node_filesystem_readonly{job="node-exporter",fstype!=""}==0)
1h
The proportion of available inodes is below 3%
Unstable network interface status
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m])
2m
The network interface status is unstable and frequently changes between "up" and "down"
Network interface data reception error
increase(node_network_receive_errs_total[2m]) > 10
1h
An error occurred while the network interface received data
Network interface data sending error
increase(node_network_transmit_errs_total[2m]) > 10
1h
An error occurred while the network interface sent data
Unsynced server clock
min_over_time(node_timex_sync_status[5m]) == 0
10m
The server time has not been synced recently. Please check whether NTP is correctly configured
Server clock skew
(node_timex_offset_seconds>0.05 and deriv(node_timex_offset_seconds[5m])>=0) or (node_timex_offset_seconds<-0.05 and deriv(node_timex_offset_seconds[5m])<=0)
10m
The server clock skew exceeds 300 seconds. Please check whether NTP is correctly configured
﻿

Ajuda e Suporte

Esta página foi útil?

Você também pode entrar em contato com a Equipe de vendas ou Enviar um tíquete em caso de ajuda.

comentários

tencent cloud

Tencent Cloud Observability Platform

Rule Type Description(old)

Kubernetes master component

Kubelet

Kubernetes Resource Use

Kubernetes Workload

Kubernetes Node

Ajuda e Suporte

Rule Name	Rule Expression	Duration	Description
Error with client access to APIServer	(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job, cluster_id) / sum(rate(rest_client_requests_total[5m])) by (instance, job, cluster_id))> 0.01	15m	The error rate of client access to the APIServer is above 1%
Imminent expiration of the client certificate for APIServer access	apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (cluster_id, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400	None	The client certificate for APIServer access will expire in 24 hours
Recording API error	sum by(cluster_id, name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2	None	The recording API reported an error in the last 5 minutes
Low recording API availability	(1 - max by(name, namespace, cluster_id)(avg_over_time(aggregator_unavailable_apiservice[5m]))) * 100 < 90	5m	The availability of the recording API service in the last 5 minutes was below 90%
APIServer fault	absent(sum(up{job="apiserver"}) by (cluster_id) > 0)	5m	APIServer disappeared from the collection targets
Scheduler fault	absent(sum(up{job="kube-scheduler"}) by (cluster_id) > 0)	15m	The scheduler disappeared from the collection targets
Controller manager fault	absent(sum(up{job="kube-controller-manager"}) by (cluster_id) > 0)	15m	The controller manager disappeared from the collection targets