tencent cloud

Tencent Kubernetes Engine

Release Notes and Announcements
Release Notes
Announcements
Release Notes
Product Introduction
Overview
Strengths
Architecture
Scenarios
Features
Concepts
Native Kubernetes Terms
Common High-Risk Operations
Regions and Availability Zones
Service Regions and Service Providers
Open Source Components
Purchase Guide
Purchase Instructions
Purchase a TKE General Cluster
Purchasing Native Nodes
Purchasing a Super Node
Getting Started
Beginner’s Guide
Quickly Creating a Standard Cluster
Examples
Container Application Deployment Check List
Cluster Configuration
General Cluster Overview
Cluster Management
Network Management
Storage Management
Node Management
GPU Resource Management
Remote Terminals
Application Configuration
Workload Management
Service and Configuration Management
Component and Application Management
Auto Scaling
Container Login Methods
Observability Configuration
Ops Observability
Cost Insights and Optimization
Scheduler Configuration
Scheduling Component Overview
Resource Utilization Optimization Scheduling
Business Priority Assurance Scheduling
QoS Awareness Scheduling
Security and Stability
TKE Security Group Settings
Identity Authentication and Authorization
Application Security
Multi-cluster Management
Planned Upgrade
Backup Center
Cloud Native Service Guide
Cloud Service for etcd
TMP
TKE Serverless Cluster Guide
TKE Registered Cluster Guide
Use Cases
Cluster
Serverless Cluster
Scheduling
Security
Service Deployment
Network
Release
Logs
Monitoring
OPS
Terraform
DevOps
Auto Scaling
Containerization
Microservice
Cost Management
Hybrid Cloud
AI
Troubleshooting
Disk Full
High Workload
Memory Fragmentation
Cluster DNS Troubleshooting
Cluster kube-proxy Troubleshooting
Cluster API Server Inaccessibility Troubleshooting
Service and Ingress Inaccessibility Troubleshooting
Common Service & Ingress Errors and Solutions
Engel Ingres appears in Connechtin Reverside
CLB Ingress Creation Error
Troubleshooting for Pod Network Inaccessibility
Pod Status Exception and Handling
Authorizing Tencent Cloud OPS Team for Troubleshooting
CLB Loopback
API Documentation
History
Introduction
API Category
Making API Requests
Elastic Cluster APIs
Resource Reserved Coupon APIs
Cluster APIs
Third-party Node APIs
Relevant APIs for Addon
Network APIs
Node APIs
Node Pool APIs
TKE Edge Cluster APIs
Cloud Native Monitoring APIs
Scaling group APIs
Super Node APIs
Other APIs
Data Types
Error Codes
TKE API 2022-05-01
FAQs
TKE General Cluster
TKE Serverless Cluster
About OPS
Hidden Danger Handling
About Services
Image Repositories
About Remote Terminals
Event FAQs
Resource Management
Service Agreement
TKE Service Level Agreement
TKE Serverless Service Level Agreement
Contact Us
Glossary

Description of tke-monitor-agent

PDF
Focus Mode
Font Size
Last updated: 2024-02-01 10:07:57

Overview

Tencent Cloud upgraded the basic monitoring architecture to improve the stability of the TKE basic monitoring and alarming feature. After the upgrade, a DaemonSet named tke-monitor-agent is deployed under the kube-system namespace in the cluster, and the K8s resource objects of authentication and authorization are created, including ClusterRole, ServiceAccount, and ClusterRoleBinding. These resource objects are all named tke-monitor-agent.

Strengths

This add-on collects the monitoring data of containers, Pods, nodes, and community add-ons. The collected data is used for basic monitoring metrics display, metrics alarming, and metric-based HPA service in the console. By deploying this add-on, you can fix the problem that the monitoring data can't be obtained due to the instability of the basic monitoring service, thereby enjoying more stable monitoring, alarming, and HPA services.

Impact

Deploying this add-on does not affect the normal running of the cluster.
If your node resources are allocated unreasonably, node load is too heavy, or node resources are not enough, deploying the basic monitoring add-on may cause the problem where the Pod corresponding to the tke-monitor-agent DaemonSet is in the status of Pending, Evicted, OOMKilled or CrashLoopBackOff. The details of the status are as follows:
Pending: The resources on the cluster node are not enough to schedule a Pod. You can schedule the Pod to the node by setting the quantity of requested resources for the tke-monitor-agent DaemonSet to 0. For more information, see Pod Remains in Pending.
Evicted: This status may be caused by insufficient node resources or a heavy load on the node. You can find out the cause and solve the problem in the following ways:
Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Message field.
Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Events field.
CrashLoopBackOff or OOMKilled: Run kubectl describe pod -n kube-system <podName> to check whether an OOM error occurs. If yes, you can increase the value of memory limits, which can't exceed 100 MB. If the error still occurs after the value is set to 100 MB, submit a ticket for assistance.
ContainerCreating: Run kubectl describe pod -n kube-system <podName> to check the Events field. If Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "<pod name >": Error response from daemon: Failed to set projid for /data/docker/overlay2/xxx-init: no space left on device is displayed, the container data disk is full, and you can clear the data disk to restore it.
Note:
If the problem persists, submit a ticket for assistance.
Quantity of resources consumed in each Pod managed by the DaemonSet (named tke-monitor-agent) is positively correlated with the number of Pods and containers running on the node. Below is a sample stress test with low MEM and CPU usage: Data volume 220 Pods are deployed on a node, and each Pod contains three containers. Resources consumed
MEM (peak)
CPU (peak)
About 40 MiB
0.01C
The stress test result of the CPU usage is as shown below:


The stress test result of the memory usage is as shown below:



Component Permission Description

Permission Description

The permission of this component is the minimal dependency required for the current feature to operate.

Permission Scenarios

Feature
Involved Object
Involved Operation Permission
It is required to gather the number of Pods and related information in the cluster.
ReplicaSets, Deployments, and Pods
list/watch
Obtaining the metric information of cadvisor by visiting the /metrics port on the Kubelet of the node.
nodes, nodes/proxy, and nodes/metrics
list/watch/get
Delivering metric data with cluster-monitor
services
list/watch
Reporting metrics to HPA-Metrics-Server
custommetrics
update

Permission Definition

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tke-monitor-agent
rules:
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "nodes/metrics"]
verbs: ["list", "watch", "get"]
- apiGroups: [""]
resources: ["services"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["list", "watch"]
- apiGroups: ["monitor.tencent.io"]
resources: ["custommetrics"]
verbs: ["update"]


Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback