tencent cloud

Tencent Cloud Smart Advisor

Release Notes
Product Introduction
Overview
Features
Product Strengths
Scenarios
Customer Cases
Purchase Guide
Getting Started
Using TSA to Perform a Cloud Risk Assessment
Using TSA to Execute a Chaos Experiment on CFG
Operation Guide
Operation Guide to TSA-Cloud Architecture
Operation Guide to TSA-Cloud Risk Assessment
Operation Guide to TSA-Chaotic Fault Generator
Operation Guide to TSA-Digital Assets
Permission Management
API Documentation
History
Introduction
API Category
Making API Requests
Other APIs
Task APIs
Cloud Architecture Console APIs
Data Types
Error Codes
FAQs
FAQs: TSA
FAQs: TSA-Cloud Risk Assessment
FAQs: TSA-Cloud Architecture
FAQs: TSA-Chaotic Fault Generator
Related Protocol
Tencent Cloud Smart Advisor Service Level Agreement
PRIVACY POLICY MODULE CHAOTIC FAULT GENERATOR
DATA PRIVACY AND SECURITY AGREEMENT MODULE CHAOTIC FAULT GENERATOR
Contact Us

Node Cordon

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-03-31 23:02:35

Background and Significance

In a Kubernetes cluster, nodes serve as the critical infrastructure for running Pods. When it is necessary to temporarily suspend workload scheduling on a node (for example, for node maintenance, upgrade, or troubleshooting), the node is typically cordoned. By setting the node to an unschedulable status, you can effectively protect the running environment on the node, preventing new Pods from being assigned to that node.
TSA-Chaotic Fault Generator (TSA-CFG) provides the CFG feature for node cordon, helping users verify the following capabilities:
1. Whether the cluster scheduler can correctly respond to the node cordon event.
2. Whether the service can maintain business continuity in the event of node cordon.
3. Enhancing the reliability and flexibility of cluster scheduling policies.
By simulating the node cordon operation, users can verify the cluster's behavior when a specific node becomes unavailable and optimize disaster recovery and scheduling policies.

Experiment Steps

Step 1: Preparing an Experiment

Create container nodes: Purchase a new native node in the Tencent Kubernetes Engine (TKE) cluster, deploy the services to be tested on the node, and ensure that running Pods exist on this node.
Use existing nodes: If the cluster already has running native nodes, existing nodes can be directly used to conduct experiments.

Step 2: Creating an Experiment

1. Log in to the Tencent Cloud Smart Advisor (TSA) console, choose Architecture Governance, select Governance Mode, and click CFG. (For details about how to create an experiment, see Using TSA to Execute a Chaos Experiment on CFG.)
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Container > Standard Cluster Ordinary Node or Container > Standard Cluster Native Node from the Experiment Instance drop-down list, click Add via Search, and add an instance resource. Alternatively, click Add via Architecture Diagram, click a TKE resource on the architecture diagram, select the required instance, and add it.
4. In the experiment action, click Add Immediately to add a fault action. Select the Node Cordon fault action, and click Next.
5. Node cordon does not require additional parameter configuration. Click OK.
6. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.

Step 3: Executing the Experiment

1. Log in to the TKE console and select Cluster in the left sidebar.
2. Click the cluster name to go to the cluster details page.
3. In the Node Management module, view the node status before fault execution.
Node health check: Before the experiment begins, ensure that the target node is in a normal running status.
Workload check: Check whether the Pods on the node are running normally.
4. Log in to the TSA console, select CFG, go to the experiment details panel, and click Execute in the fault action group or Start Experiment in the lower part of the panel.
5. Click the action card to view the action execution details.
6. View the execution log and confirm that the execution is successful. Confirm whether the node status has changed to Cordoned, and verify whether the cluster scheduler has stopped assigning new Pods to this node.

Step 4: Verifying the Experiment Effect

1. Node status: In the TKE console, the node status should have changed to Cordoned on the Node Management page.
2. Pod scheduling status: Check whether no new Pods are scheduled to this node and ensure that the existing Pods remain running normally on the node.
3. Service availability: Verify whether the service can continue to function properly when the node is cordoned.

Step 5: Performing the Recovery Operation

1. Go to the experiment details panel.
2. Execute the fault recovery action to uncordon the target node.
3. Check whether the target node has returned to normal after recovery, and ensure that new Pods can be scheduled to this node normally.

Step 6: Verifying the Recovery Effect

1. Node status: In the TKE console, ensure that the node restores to the Healthy status on the Node Management page and can accept the scheduling of new Pods.
2. Service health: Check whether all services within the cluster are running normally.

Must-Knows

1. Service impact assessment: Node cordoning does not affect Pods already running on the node but prevents new Pods from being scheduled to it.
2. Compatibility with scheduling policies: Ensure that scheduling policies (such as PodDisruptionBudget) allow Pods to be scheduled normally when the node is cordoned.
3. Uncordoning: After the experiment is completed, be sure to uncordon the node to restore it to a schedulable status.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백