tencent cloud

Tencent Kubernetes Engine

Release Notes and Announcements
Release Notes
Announcements
Release Notes
Product Introduction
Overview
Strengths
Architecture
Scenarios
Features
Concepts
Native Kubernetes Terms
Common High-Risk Operations
Regions and Availability Zones
Service Regions and Service Providers
Open Source Components
Purchase Guide
Purchase Instructions
Purchase a TKE General Cluster
Purchasing Native Nodes
Purchasing a Super Node
Getting Started
Beginner’s Guide
Quickly Creating a Standard Cluster
Examples
Container Application Deployment Check List
Cluster Configuration
General Cluster Overview
Cluster Management
Network Management
Storage Management
Node Management
GPU Resource Management
Remote Terminals
Application Configuration
Workload Management
Service and Configuration Management
Component and Application Management
Auto Scaling
Container Login Methods
Observability Configuration
Ops Observability
Cost Insights and Optimization
Scheduler Configuration
Scheduling Component Overview
Resource Utilization Optimization Scheduling
Business Priority Assurance Scheduling
QoS Awareness Scheduling
Security and Stability
TKE Security Group Settings
Identity Authentication and Authorization
Application Security
Multi-cluster Management
Planned Upgrade
Backup Center
Cloud Native Service Guide
Cloud Service for etcd
TMP
TKE Serverless Cluster Guide
TKE Registered Cluster Guide
Use Cases
Cluster
Serverless Cluster
Scheduling
Security
Service Deployment
Network
Release
Logs
Monitoring
OPS
Terraform
DevOps
Auto Scaling
Containerization
Microservice
Cost Management
Hybrid Cloud
AI
Troubleshooting
Disk Full
High Workload
Memory Fragmentation
Cluster DNS Troubleshooting
Cluster kube-proxy Troubleshooting
Cluster API Server Inaccessibility Troubleshooting
Service and Ingress Inaccessibility Troubleshooting
Common Service & Ingress Errors and Solutions
Engel Ingres appears in Connechtin Reverside
CLB Ingress Creation Error
Troubleshooting for Pod Network Inaccessibility
Pod Status Exception and Handling
Authorizing Tencent Cloud OPS Team for Troubleshooting
CLB Loopback
API Documentation
History
Introduction
API Category
Making API Requests
Elastic Cluster APIs
Resource Reserved Coupon APIs
Cluster APIs
Third-party Node APIs
Relevant APIs for Addon
Network APIs
Node APIs
Node Pool APIs
TKE Edge Cluster APIs
Cloud Native Monitoring APIs
Scaling group APIs
Super Node APIs
Other APIs
Data Types
Error Codes
TKE API 2022-05-01
FAQs
TKE General Cluster
TKE Serverless Cluster
About OPS
Hidden Danger Handling
About Services
Image Repositories
About Remote Terminals
Event FAQs
Resource Management
Service Agreement
TKE Service Level Agreement
TKE Serverless Service Level Agreement
Contact Us
Glossary
DocumentationTencent Kubernetes EngineUse CasesOPSQuick Troubleshooting Using TKE Audit and Event Services

Quick Troubleshooting Using TKE Audit and Event Services

PDF
Focus Mode
Font Size
Last updated: 2024-12-13 21:12:47

Use Cases

The cluster auditing and event storage features of TKE are configured with rich visual charts to display audit logs and cluster events in multiple dimensions. Their operations are simple, and most common cluster Ops use cases are covered, making it easy for you to find and locate problems, improve the Ops efficiency, and maximize the value of audit and event data.This document describes how to use audit and event dashboards to quickly locate cluster problems for several use cases.

Prerequisites

You have logged in to the TKE console and enabled cluster audit and event storage.

Example

Sample 1. Troubleshooting workload disappearance

1. Log in to the TKE console.
2. Select Log Management > Audit Logs in the left sidebar to go to the Audit log search page.
3. Select the K8s Object Operation Overview tab and specify the operation type and resource object to be checked in Filters as shown below:


4. The query result is displayed, as shown in the figure below:

As shown above, the 10001****7138 account deleted the nginx application at 2020-11-30T03:37:13. For more information on the account, select CAM > User List.

Sample 2. Troubleshooting node cordoning

1. Log in to the TKE console.
2. Select Log Management > Audit Logs in the left sidebar to go to the Audit log search page.
3. Select the Node Operation Overview tab and specify the name of the cordoned node in Filters as shown below:


4. Click Filter to start the query. The result is as shown below:

As shown in the above figure, account 10001****7138 cordoned the node 172.16.18.13 at 2020-11-30T06:22:18.

Sample 3. Troubleshooting slow API server response

1. Log in to the TKE console.
2. Select Log Management > Audit Logs in the left sidebar to go to the Audit log search page.
3. Select the Aggregated Search tab, which provides trend graphs of API server access requests in multiple dimensions, such as user, operation type, and return status code, as shown below:

Operator distribution trend:





Operation type distribution trend:





Status code distribution trend
:

As shown above, the tke-kube-state-metrics user has much more access requests than others. The operation type distribution trend shows that most of the operations are LIST operations, and the status code distribution trend shows that most of the status codes are 403. The business logs show that the tke-kube-state-metrics add-on kept requesting API server retries due to the RBAC authentication issue, resulting in a sharp increase in API server access requests. Below is a sample log:
E1130 06:19:37.368981 1 reflector.go:156] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:tke-kube-state-metrics" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

Sample 4. Troubleshooting a node exception

1. Log in to the TKE console.
2. Select Log Management > Event Logs in the left sidebar to go to the Event search page.
3. Select the Event Overview tab and enter the abnormal node IP in the Resource Object filter as shown below:


4. Click Filter to start the query. The results show that there is an event of Insufficient disk space of the node.
5. Click the event to further view the trend of the abnormal event.



As shown in the above figure, starting from 2020-11-25, the node 172.16.18.13 was exceptional due to insufficient disk space. Then kubelet began to drain pods on the node to reclaim the node's disk space.

Sample 5. Locating a node scale-out trigger

The cluster auto-scaler (CA) add-on automatically increases or decreases the number of nodes in the cluster according to the load condition when node pool elastic scaling is enabled. If a node in the cluster is automatically scaled, you can backtrack the whole scaling process through event search.
1. Log in to the TKE console.
2. Select Log Management > Event Logs in the left sidebar to go to the Event search page.
3. Select the Global Search tab and enter the following search command in the search box:
event.source.component : "cluster-autoscaler"
4. Select event.reason, event.message, and event.involvedObject.name from the Hidden Fields on the left for display. Click Search and Analysis and view the results.
5. Sort the search results by Log Time in reverse order as shown below:

According to the event flow in the above figure, you can find that the node scaling occurred around 2020-11-25 20:35:45 and was triggered by three Nginx pods (nginx-5dbf784b68-tq8rd, nginx-5dbf784b68-fpvbx, and nginx-5dbf784b68-v9jv5). After three nodes were scaled out, the subsequent scaling was not triggered because the number of nodes in the node pool reached the upper limit.

Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback