Background and Significance
In a Kubernetes cluster, nodes serve as the critical infrastructure for running Pods. When it is necessary to temporarily suspend workload scheduling on a node (for example, for node maintenance, upgrade, or troubleshooting), the node is typically cordoned. By setting the node to an unschedulable status, you can effectively protect the running environment on the node, preventing new Pods from being assigned to that node.
TSA-Chaotic Fault Generator (TSA-CFG) provides the CFG feature for node cordon, helping users verify the following capabilities:
1. Whether the cluster scheduler can correctly respond to the node cordon event.
2. Whether the service can maintain business continuity in the event of node cordon.
3. Enhancing the reliability and flexibility of cluster scheduling policies.
By simulating the node cordon operation, users can verify the cluster's behavior when a specific node becomes unavailable and optimize disaster recovery and scheduling policies.
Experiment Steps
Step 1: Preparing an Experiment
Create container nodes: Purchase a new native node in the Tencent Kubernetes Engine (TKE) cluster, deploy the services to be tested on the node, and ensure that running Pods exist on this node.
Use existing nodes: If the cluster already has running native nodes, existing nodes can be directly used to conduct experiments.
Step 2: Creating an Experiment
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Container > Standard Cluster Ordinary Node or Container > Standard Cluster Native Node from the Experiment Instance drop-down list, click Add via Search, and add an instance resource. Alternatively, click Add via Architecture Diagram, click a TKE resource on the architecture diagram, select the required instance, and add it.
4. In the experiment action, click Add Immediately to add a fault action. Select the Node Cordon fault action, and click Next.
5. Node cordon does not require additional parameter configuration. Click OK.
6. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.
Step 3: Executing the Experiment
1. Log in to the TKE console and select Cluster in the left sidebar. 2. Click the cluster name to go to the cluster details page.
3. In the Node Management module, view the node status before fault execution.
Node health check: Before the experiment begins, ensure that the target node is in a normal running status.
Workload check: Check whether the Pods on the node are running normally.
4. Log in to the TSA console, select CFG, go to the experiment details panel, and click Execute in the fault action group or Start Experiment in the lower part of the panel. 5. Click the action card to view the action execution details.
6. View the execution log and confirm that the execution is successful. Confirm whether the node status has changed to Cordoned, and verify whether the cluster scheduler has stopped assigning new Pods to this node.
Step 4: Verifying the Experiment Effect
1. Node status: In the TKE console, the node status should have changed to Cordoned on the Node Management page. 2. Pod scheduling status: Check whether no new Pods are scheduled to this node and ensure that the existing Pods remain running normally on the node.
3. Service availability: Verify whether the service can continue to function properly when the node is cordoned.
Step 5: Performing the Recovery Operation
1. Go to the experiment details panel.
2. Execute the fault recovery action to uncordon the target node.
3. Check whether the target node has returned to normal after recovery, and ensure that new Pods can be scheduled to this node normally.
Step 6: Verifying the Recovery Effect
1. Node status: In the TKE console, ensure that the node restores to the Healthy status on the Node Management page and can accept the scheduling of new Pods. 2. Service health: Check whether all services within the cluster are running normally.
Must-Knows
1. Service impact assessment: Node cordoning does not affect Pods already running on the node but prevents new Pods from being scheduled to it.
2. Compatibility with scheduling policies: Ensure that scheduling policies (such as PodDisruptionBudget) allow Pods to be scheduled normally when the node is cordoned.
3. Uncordoning: After the experiment is completed, be sure to uncordon the node to restore it to a schedulable status.