tencent cloud

Tencent Cloud Smart Advisor

Release Notes
Product Introduction
Overview
Features
Product Strengths
Scenarios
Customer Cases
Purchase Guide
Getting Started
Using TSA to Perform a Cloud Risk Assessment
Using TSA to Execute a Chaos Experiment on CFG
Operation Guide
Operation Guide to TSA-Cloud Architecture
Operation Guide to TSA-Cloud Risk Assessment
Operation Guide to TSA-Chaotic Fault Generator
Operation Guide to TSA-Digital Assets
Permission Management
API Documentation
History
Introduction
API Category
Making API Requests
Other APIs
Task APIs
Cloud Architecture Console APIs
Data Types
Error Codes
FAQs
FAQs: TSA
FAQs: TSA-Cloud Risk Assessment
FAQs: TSA-Cloud Architecture
FAQs: TSA-Chaotic Fault Generator
Related Protocol
Tencent Cloud Smart Advisor Service Level Agreement
PRIVACY POLICY MODULE CHAOTIC FAULT GENERATOR
DATA PRIVACY AND SECURITY AGREEMENT MODULE CHAOTIC FAULT GENERATOR
Contact Us

Node drain

PDF
聚焦模式
字号
最后更新时间: 2026-03-31 23:02:35

Background and Significance

In a Kubernetes environment, nodes serve as the critical infrastructure for running Pods. When a node requires maintenance or upgrade or encounters a fault, the node drain operation is typically used to evict Pods from the node to other nodes, ensuring service continuity and high availability. TSA-Chaotic Fault Generator (TSA-CFG) provides the CFG feature for node drain, helping users verify the following capabilities:
1. Whether the cluster scheduler can automatically reschedule Pods.
2. Whether the service can maintain business continuity when the node is unavailable.
3. Enhancing the disaster recovery capabilities and resilience of the system under extreme conditions.
By simulating the node drain operation, users can identify potential scheduling issues and optimize disaster recovery policies.

Experiment Steps

Step 1: Preparing an Experiment

1. Purchase a standard cluster instance: Ensure that a Kubernetes standard cluster has been deployed.
2. Deploy test services: Deploy at least 1 test service on the node to observe the impact of node operations.

Step 2: Configuring Experiment Resources

Create container nodes: Create a new node and add it to the cluster. Deploy test services (such as Nginx or simple Pod services).
Use existing nodes: If the cluster already has running native nodes, you can directly use the existing nodes to conduct experiments.

Step 3: Creating an Experiment

1. Log in to the Tencent Cloud Smart Advisor (TSA) console, choose Architecture Governance, select Governance Mode, and click CFG. (For details about how to create an experiment, see Using TSA to Execute a Chaos Experiment on CFG.)
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Container > Standard Cluster Ordinary Node or Container > Standard Cluster Native Node from the Experiment Instance drop-down list, click Add via Search, and add an instance resource. Alternatively, click Add via Architecture Diagram, click a Tencent Kubernetes Engine (TKE) resource on the architecture diagram, select the required instance, and add it.
4. In the experiment action, click Add Immediately to add a fault action. Select the Node Drain fault action, and click Next.
5. Set action parameters, and click OK.
Pod Eviction Timeout (s): specifies the timeout period for Pod eviction. If Pods are not evicted within the specified time, the action will fail to be executed.
Delete Pods with Local Storage: equivalent to --delete-local-data. If this parameter is set to Yes, Pods using emptyDir will be evicted, and there is a risk of local data being deleted.
6. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.

Step 4: Executing the Experiment

1. Log in to the TKE console and select Cluster in the left sidebar.
2. Click the cluster name to go to the cluster details page.
3. In the Node Management module, view the node status before fault execution.
Node health check: Before the experiment begins, ensure that the target node is in a normal running status.
Workload check: Check whether the Pods on the node are running normally.
4. Log in to the TSA console, select CFG, go to the experiment details panel, and click Execute in the fault action group or Start Experiment in the lower part of the panel.
5. Click the action card to view the action execution details.
6. View the execution log and confirm that the execution is successful. Verify whether the node status has changed to unschedulable and whether the Pods on the node have been rescheduled to other available nodes.

Step 5: Verifying the Experiment Effect

1. Node status: In the TKE console, check whether the node status has changed to Cordoned on the Node Management page.
2. Pod scheduling status: Check whether all Pods on the node have been successfully migrated to other nodes and remain running normally.
3. Service availability: Verify whether the service can continue to function properly when the node is unavailable.

Step 6: Performing the Recovery Operation

1. Go to the experiment details panel.
2. Execute the fault recovery action, and confirm that the recovery action is successfully executed.
3. Check whether the node status is healthy after recovery.

帮助和支持

本页内容是否解决了您的问题?

填写满意度调查问卷,共创更好文档体验。

文档反馈