tencent cloud

Feedback

Using Cluster Audit for Troubleshooting

Last updated: 2023-05-06 17:36:46

    Overview

    Cluster resources may be deleted or modified in the case of misoperations, application bugs, or apiserver API calls from malicious programs. You can use the cluster audit feature to keep logs of apiserver API calls. In this way, you can search and analyze audit logs to find the causes of problems. This document describes how to use the cluster audit feature for troubleshooting.
    Note
    This document applies to only TKE clusters.

    Prerequisites

    You have enabled the cluster audit feature in the TKE console. For more information, see Enabling cluster audit.

    Use Cases

    Obtaining the analysis result

    1. Log in to the Cloud Log Service (CLS) console. In the left sidebar, click Search and Analysis.
    2. On the Search and Analysis page, select the logset and log topic to search and a time scope.
    3. Enter an analysis statement and click Search and Analysis to obtain the analysis result.

    Example 1: querying the operator who cordoned a node

    To query the operator who cordoned a node, run the following command:
    objectRef.resource:nodes AND requestObject:unschedulable
    On the Search and Analysis page, select Default Configuration for the layout. The following figure shows the query result:
    
    

    Example 2: querying the operator who deleted a workload

    To query the operator who deleted a workload, run the following command:
    objectRef.resource:deployments AND objectRef.name:"nginx" AND verb:"delete"
    You can obtain detailed information about the operator sub-account from the query result.
    
    .

    Example 3: locating the causes of apiserver access limitation

    To prevent apiserver/etcd from being overloaded due to frequent apiserver access caused by malicious programs or bugs, apiserver enables an access limit mechanism by default. If the access limit is reached, you can identify the clients that have sent large numbers of requests through audit logs.
    1. If you need to analyze clients that send requests based on userAgent, modify the log topic in the Key-Value Index window and collect statistics based on the userAgent field, as shown below:
    
    2. Run the following command to collect QPS statistics from each client to the apiserver:
    * | SELECT histogram( cast(__TIMESTAMP__ as timestamp),interval 1 minute) AS time, COUNT(1) AS qps,userAgent GROUP BY time,userAgent ORDER BY time
    3. Switch to the statistical chart and select the sequence diagram. Specify the basic information and coordinate axes, as shown below:
    
    You can click specific statistics to add the statistics to the dashboard for zoomed-in display, as shown below:
    
    As can be seen in the figure above, the client kube-state-metrics sends far more requests than the other clients. According to the logs, kube-state-metrics frequently sends requests to the apiserver due to RBAC permission issues. As a result, the apiserver access limit is triggered. The logs involved are as follows:
    I1009 13:13:09.760767 1 request.go:538] Throttling request took 1.393921018s, request: GET:https://172.16.252.1:443/api/v1/endpoints?limit=500&resourceVersion=1029843735
    E1009 13:13:09.766106 1 reflector.go:156] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:monitoring:kube-state-metrics" cannot list resource "endpoints" in API group "" at the cluster scope
    To use other fields, such as user.username, to distinguish the clients to collect data on, you can modify the SQL statement as required. An example SQL statement is as follows:
    * | SELECT histogram( cast(__TIMESTAMP__ as timestamp),interval 1 minute) AS time, COUNT(1) AS qps,user.username GROUP BY time,user.username ORDER BY time
    The following figure shows the display result:
    

    References

    For more information about the TKE cluster audit feature and basic operations, see Cluster Audit.
    Cluster audit data is stored in CLS. To query and analyze audit data in the CLS console, see Syntax Rules for the search syntax.
    To analyze audit data, an SQL statement supported by CLS is required. For more information, see Overview.
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support