tencent cloud

Tencent Kubernetes Engine

Release Notes and Announcements
Release Notes
Announcements
Release Notes
Product Introduction
Overview
Strengths
Architecture
Scenarios
Features
Concepts
Native Kubernetes Terms
Common High-Risk Operations
Regions and Availability Zones
Service Regions and Service Providers
Open Source Components
Purchase Guide
Purchase Instructions
Purchase a TKE General Cluster
Purchasing Native Nodes
Purchasing a Super Node
Getting Started
Beginner’s Guide
Quickly Creating a Standard Cluster
Examples
Container Application Deployment Check List
Cluster Configuration
General Cluster Overview
Cluster Management
Network Management
Storage Management
Node Management
GPU Resource Management
Remote Terminals
Application Configuration
Workload Management
Service and Configuration Management
Component and Application Management
Auto Scaling
Container Login Methods
Observability Configuration
Ops Observability
Cost Insights and Optimization
Scheduler Configuration
Scheduling Component Overview
Resource Utilization Optimization Scheduling
Business Priority Assurance Scheduling
QoS Awareness Scheduling
Security and Stability
TKE Security Group Settings
Identity Authentication and Authorization
Application Security
Multi-cluster Management
Planned Upgrade
Backup Center
Cloud Native Service Guide
Cloud Service for etcd
TMP
TKE Serverless Cluster Guide
TKE Registered Cluster Guide
Use Cases
Cluster
Serverless Cluster
Scheduling
Security
Service Deployment
Network
Release
Logs
Monitoring
OPS
Terraform
DevOps
Auto Scaling
Containerization
Microservice
Cost Management
Hybrid Cloud
AI
Troubleshooting
Disk Full
High Workload
Memory Fragmentation
Cluster DNS Troubleshooting
Cluster kube-proxy Troubleshooting
Cluster API Server Inaccessibility Troubleshooting
Service and Ingress Inaccessibility Troubleshooting
Common Service & Ingress Errors and Solutions
Engel Ingres appears in Connechtin Reverside
CLB Ingress Creation Error
Troubleshooting for Pod Network Inaccessibility
Pod Status Exception and Handling
Authorizing Tencent Cloud OPS Team for Troubleshooting
CLB Loopback
API Documentation
History
Introduction
API Category
Making API Requests
Elastic Cluster APIs
Resource Reserved Coupon APIs
Cluster APIs
Third-party Node APIs
Relevant APIs for Addon
Network APIs
Node APIs
Node Pool APIs
TKE Edge Cluster APIs
Cloud Native Monitoring APIs
Scaling group APIs
Super Node APIs
Other APIs
Data Types
Error Codes
TKE API 2022-05-01
FAQs
TKE General Cluster
TKE Serverless Cluster
About OPS
Hidden Danger Handling
About Services
Image Repositories
About Remote Terminals
Event FAQs
Resource Management
Service Agreement
TKE Service Level Agreement
TKE Serverless Service Level Agreement
Contact Us
Glossary

NodeProblemDetectorPlus Add-on

PDF
Mode fokus
Ukuran font
Terakhir diperbarui: 2024-02-01 10:15:37

Overview

Add-on description

Node-Problem-Detector-Plus is an add-on that monitors the health status of Kubernetes cluster nodes. It runs in the TKE environment as a DaemonSet to help users detect various exceptions on nodes in real time and report the detection results to the upstream Kube-apiserver.

Kubernetes objects deployed in a cluster

Kubernetes Object Name
Type
Resource Amount
Namespaces
node-problem-detector
DaemonSet
0.5C 80M
kube-system
node-problem-detector
ServiceAccount
-
kube-system
node-problem-detector
ClusterRole
-
-
node-problem-detector
ClusterRoleBinding
-
-

Use Cases

Node-Problem-Detector-Plus can be used to monitor the running status of nodes, including kernel deadlocks, OOM, system thread pressure, system file descriptor pressure, and other metrics. It reports such information to the API Server as Node Conditions and Events. You can estimate the resource pressure of nodes by detecting the corresponding metrics and then manually release or scale out node resources before nodes start draining pods. In this way, you can prevent potential losses resulted from Kubernetes resource repossessing or node unavailability.

Limits

To use NPD in your cluster, you need to install this add-on in your cluster. The system resources used by NPD containers is restricted to 0.5 CPU core and 80 MB memory.

Component Permission Description

Permission Description

The permission of this component is the minimal dependency required for the current feature to operate.

Permission Scenarios

Feature
Involved Object
Involved Operation Permission
It is required to report fault information when a node encounters a malfunction and modify its condition.
nodestatus
patch
It is required to send event notifications to the cluster.
event
create/patch/update

Permission Definition

rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update

Usage

1. Log in to the TKE console and select Cluster in the left sidebar.
2. On the “Cluster Management page, click the ID of the target cluster to go to the cluster details page.
3. In the left sidebar, click Add-on Management to go to the Add-on List page.
4. On the Add-on List page, click Create to go to the Create Add-on page, and select NodeProblemDetectorPlus.
5. Click Complete. After the installation is successful, the corresponding node-problem-detector resources are available in your cluster, and the corresponding conditions will be added to Node Conditions.

Appendix

Node Conditions

After the NPD plug-in is installed, the following specific Conditions will be added to nodes:
Condition
Default Value
Description
ReadonlyFilesystem
False
Indicates whether the file system is read-only.
FDPressure
False
Queries whether the number of file descriptors of the host reaches 80% of the max value.
FrequentKubeletRestart
False
Indicates whether Kubelet has restarted more than 5 times in 20 minutes.
CorruptDockerOverlay2
False
Indicates whether the DockerImage is faulty.
KubeletProblem
False
Indicates whether the Kubelet service is Running.
KernelDeadlock
False
Indicates whether a deadlock exists in the kernel.
FrequentDockerRestart
False
Indicates whether Docker has restarted more than 5 times in 20 minutes.
FrequentContainerdRestart
False
Indicates whether Containerd has restarted more than 5 times in 20 minutes.
DockerdProblem
False
Indicates whether the Docker service is Running (if the node runtime is Containerd, the value is always False).
ContainerdProblem
False
Indicates whether the Containerd service is Running (if the node runtime is Docker, the value is always False).
ThreadPressure
False
Indicates whether the current number of threads of the system reaches 90% of the max value.
NetworkUnavailable
False
Indicates whether the NTP service status is Running.
SerfFailed
False
Detects the node network health status in distributed mode.

Bantuan dan Dukungan

Apakah halaman ini membantu?

masukan