Release Notes and Announcements
- Release Notes
- Announcements
- Release Notes
Product Introduction
Purchase Guide
- Purchase Instructions
- Purchase a TKE General Cluster
- Purchasing Native Nodes
- Purchasing a Super Node
Getting Started
Cluster Configuration
- General Cluster Overview
- Cluster Management
- Network Management
- Storage Management
- Node Management
- GPU Resource Management
- Remote Terminals
Application Configuration
- Workload Management
- Service and Configuration Management
- Component and Application Management
- Auto Scaling
- Container Login Methods
Observability Configuration
- Ops Observability
- Cost Insights and Optimization
Scheduler Configuration
- Scheduling Component Overview
- Resource Utilization Optimization Scheduling
- Business Priority Assurance Scheduling
- QoS Awareness Scheduling
Security and Stability
- TKE Security Group Settings
- Identity Authentication and Authorization
- Application Security
Multi-cluster Management
- Planned Upgrade
- Backup Center
Cloud Native Service Guide
- Cloud Service for etcd
- TMP
- TKE Serverless Cluster Guide
- TKE Registered Cluster Guide
Use Cases
- Cluster
- Serverless Cluster
- Scheduling
- Security
- Service Deployment
- Network
- Release
- Logs
- Monitoring
- OPS
- Terraform
- DevOps
- Auto Scaling
- Containerization
- Cost Management
- Hybrid Cloud
- AI
Troubleshooting
API Documentation
- History
- Introduction
- API Category
- Making API Requests
- Elastic Cluster APIs
- Resource Reserved Coupon APIs
- Cluster APIs
- Third-party Node APIs
- Relevant APIs for Addon
- Network APIs
- Node APIs
- Node Pool APIs
- TKE Edge Cluster APIs
- Cloud Native Monitoring APIs
- Scaling group APIs
- Super Node APIs
- Other APIs
- Data Types
- Error Codes
- TKE API 2022-05-01
FAQs
- TKE General Cluster
- TKE Serverless Cluster
- About OPS
- Hidden Danger Handling
- About Services
- Image Repositories
- About Remote Terminals
- Event FAQs
- Resource Management
Service Agreement
- TKE Service Level Agreement
- TKE Serverless Service Level Agreement
Contact Us
Glossary

NodeProblemDetectorPlus Add-on

Download

Mode fokus

Ukuran font

Terakhir diperbarui: 2024-02-01 10:15:37

Overview
Add-on description
Node-Problem-Detector-Plus is an add-on that monitors the health status of Kubernetes cluster nodes. It runs in the TKE environment as a DaemonSet to help users detect various exceptions on nodes in real time and report the detection results to the upstream Kube-apiserver.
Kubernetes objects deployed in a cluster
Kubernetes Object Name
Type
Resource Amount
Namespaces
node-problem-detector
DaemonSet
0.5C 80M
kube-system
node-problem-detector
ServiceAccount
-
kube-system
node-problem-detector
ClusterRole
-
-
node-problem-detector
ClusterRoleBinding
-
-
Use Cases
Node-Problem-Detector-Plus can be used to monitor the running status of nodes, including kernel deadlocks, OOM, system thread pressure, system file descriptor pressure, and other metrics. It reports such information to the API Server as Node Conditions and Events.
You can estimate the resource pressure of nodes by detecting the corresponding metrics and then manually release or scale out node resources before nodes start draining pods. In this way, you can prevent potential losses resulted from Kubernetes resource repossessing or node unavailability.
Limits
To use NPD in your cluster, you need to install this add-on in your cluster. The system resources used by NPD containers is restricted to 0.5 CPU core and 80 MB memory.
Component Permission Description
Permission Description
The permission of this component is the minimal dependency required for the current feature to operate.
Permission Scenarios
Feature
Involved Object
Involved Operation Permission
It is required to report fault information when a node encounters a malfunction and modify its condition.
nodestatus
patch
It is required to send event notifications to the cluster.
event
create/patch/update
Permission Definition
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update
Usage
1. Log in to the TKE console and select Cluster in the left sidebar.
2. On the “Cluster Management page, click the ID of the target cluster to go to the cluster details page.
3. In the left sidebar, click Add-on Management to go to the Add-on List page.
4. On the Add-on List page, click Create to go to the Create Add-on page, and select NodeProblemDetectorPlus.
5. Click Complete. After the installation is successful, the corresponding node-problem-detector resources are available in your cluster, and the corresponding conditions will be added to Node Conditions.
Appendix
Node Conditions
After the NPD plug-in is installed, the following specific Conditions will be added to nodes:
Condition
Default Value
Description
ReadonlyFilesystem
False
Indicates whether the file system is read-only.
FDPressure
False
Queries whether the number of file descriptors of the host reaches 80% of the max value.
FrequentKubeletRestart
False
Indicates whether Kubelet has restarted more than 5 times in 20 minutes.
CorruptDockerOverlay2
False
Indicates whether the DockerImage is faulty.
KubeletProblem
False
Indicates whether the Kubelet service is Running.
KernelDeadlock
False
Indicates whether a deadlock exists in the kernel.
FrequentDockerRestart
False
Indicates whether Docker has restarted more than 5 times in 20 minutes.
FrequentContainerdRestart
False
Indicates whether Containerd has restarted more than 5 times in 20 minutes.
DockerdProblem
False
Indicates whether the Docker service is Running (if the node runtime is Containerd, the value is always False).
ContainerdProblem
False
Indicates whether the Containerd service is Running (if the node runtime is Docker, the value is always False).
ThreadPressure
False
Indicates whether the current number of threads of the system reaches 90% of the max value.
NetworkUnavailable
False
Indicates whether the NTP service status is Running.
SerfFailed
False
Detects the node network health status in distributed mode.

Bantuan dan Dukungan

Apakah halaman ini membantu?

Anda juga dapat Menghubungi Penjualan atau Mengirimkan Tiket untuk meminta bantuan.

masukan

tencent cloud

Tencent Kubernetes Engine

NodeProblemDetectorPlus Add-on

Overview

Add-on description

Kubernetes objects deployed in a cluster

Use Cases

Limits

Component Permission Description

Permission Description

Permission Scenarios

Permission Definition

Usage

Appendix

Node Conditions

Bantuan dan Dukungan

Kubernetes Object Name	Type	Resource Amount	Namespaces
node-problem-detector	DaemonSet	0.5C 80M	kube-system
node-problem-detector	ServiceAccount	-	kube-system
node-problem-detector	ClusterRole	-	-
node-problem-detector	ClusterRoleBinding	-	-

Feature	Involved Object	Involved Operation Permission
It is required to report fault information when a node encounters a malfunction and modify its condition.	nodestatus	patch
It is required to send event notifications to the cluster.	event	create/patch/update

Condition	Default Value	Description
ReadonlyFilesystem	False	Indicates whether the file system is read-only.
FDPressure	False	Queries whether the number of file descriptors of the host reaches 80% of the max value.
FrequentKubeletRestart	False	Indicates whether Kubelet has restarted more than 5 times in 20 minutes.
CorruptDockerOverlay2	False	Indicates whether the DockerImage is faulty.
KubeletProblem	False	Indicates whether the Kubelet service is Running.
KernelDeadlock	False	Indicates whether a deadlock exists in the kernel.
FrequentDockerRestart	False	Indicates whether Docker has restarted more than 5 times in 20 minutes.
FrequentContainerdRestart	False	Indicates whether Containerd has restarted more than 5 times in 20 minutes.
DockerdProblem	False	Indicates whether the Docker service is Running (if the node runtime is Containerd, the value is always False).
ContainerdProblem	False	Indicates whether the Containerd service is Running (if the node runtime is Docker, the value is always False).
ThreadPressure	False	Indicates whether the current number of threads of the system reaches 90% of the max value.
NetworkUnavailable	False	Indicates whether the NTP service status is Running.
SerfFailed	False	Detects the node network health status in distributed mode.