tencent cloud

Feedback

Description of tke-monitor-agent

Last updated: 2024-02-01 10:07:57

    Overview

    Tencent Cloud upgraded the basic monitoring architecture to improve the stability of the TKE basic monitoring and alarming feature. After the upgrade, a DaemonSet named tke-monitor-agent is deployed under the kube-system namespace in the cluster, and the K8s resource objects of authentication and authorization are created, including ClusterRole, ServiceAccount, and ClusterRoleBinding. These resource objects are all named tke-monitor-agent.

    Strengths

    This add-on collects the monitoring data of containers, Pods, nodes, and community add-ons. The collected data is used for basic monitoring metrics display, metrics alarming, and metric-based HPA service in the console. By deploying this add-on, you can fix the problem that the monitoring data can't be obtained due to the instability of the basic monitoring service, thereby enjoying more stable monitoring, alarming, and HPA services.

    Impact

    Deploying this add-on does not affect the normal running of the cluster.
    If your node resources are allocated unreasonably, node load is too heavy, or node resources are not enough, deploying the basic monitoring add-on may cause the problem where the Pod corresponding to the tke-monitor-agent DaemonSet is in the status of Pending, Evicted, OOMKilled or CrashLoopBackOff. The details of the status are as follows:
    Pending: The resources on the cluster node are not enough to schedule a Pod. You can schedule the Pod to the node by setting the quantity of requested resources for the tke-monitor-agent DaemonSet to 0. For more information, see Pod Remains in Pending.
    Evicted: This status may be caused by insufficient node resources or a heavy load on the node. You can find out the cause and solve the problem in the following ways:
    Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Message field.
    Run kubectl describe pod -n kube-system <podName> to check the cause according to the description in the Events field.
    CrashLoopBackOff or OOMKilled: Run kubectl describe pod -n kube-system <podName> to check whether an OOM error occurs. If yes, you can increase the value of memory limits, which can't exceed 100 MB. If the error still occurs after the value is set to 100 MB, submit a ticket for assistance.
    ContainerCreating: Run kubectl describe pod -n kube-system <podName> to check the Events field. If Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "<pod name >": Error response from daemon: Failed to set projid for /data/docker/overlay2/xxx-init: no space left on device is displayed, the container data disk is full, and you can clear the data disk to restore it.
    Note:
    If the problem persists, submit a ticket for assistance.
    Quantity of resources consumed in each Pod managed by the DaemonSet (named tke-monitor-agent) is positively correlated with the number of Pods and containers running on the node. Below is a sample stress test with low MEM and CPU usage: Data volume 220 Pods are deployed on a node, and each Pod contains three containers. Resources consumed
    MEM (peak)
    CPU (peak)
    About 40 MiB
    0.01C
    The stress test result of the CPU usage is as shown below:
    
    
    The stress test result of the memory usage is as shown below:
    
    

    Component Permission Description

    Permission Description

    The permission of this component is the minimal dependency required for the current feature to operate.

    Permission Scenarios

    Feature
    Involved Object
    Involved Operation Permission
    It is required to gather the number of Pods and related information in the cluster.
    ReplicaSets, Deployments, and Pods
    list/watch
    Obtaining the metric information of cadvisor by visiting the /metrics port on the Kubelet of the node.
    nodes, nodes/proxy, and nodes/metrics
    list/watch/get
    Delivering metric data with cluster-monitor
    services
    list/watch
    Reporting metrics to HPA-Metrics-Server
    custommetrics
    update

    Permission Definition

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: tke-monitor-agent
    rules:
    - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["list", "watch"]
    - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["list", "watch"]
    - apiGroups: [""]
    resources: ["nodes", "nodes/proxy", "nodes/metrics"]
    verbs: ["list", "watch", "get"]
    - apiGroups: [""]
    resources: ["services"]
    verbs: ["list", "watch"]
    - apiGroups: [""]
    resources: ["pods"]
    verbs: ["list", "watch"]
    - apiGroups: ["monitor.tencent.io"]
    resources: ["custommetrics"]
    verbs: ["update"]
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support