tencent cloud

Tencent Cloud Smart Advisor

Release Notes
Product Introduction
Overview
Features
Product Strengths
Scenarios
Customer Cases
Purchase Guide
Getting Started
Using TSA to Perform a Cloud Risk Assessment
Using TSA to Execute a Chaos Experiment on CFG
Operation Guide
Operation Guide to TSA-Cloud Architecture
Operation Guide to TSA-Cloud Risk Assessment
Operation Guide to TSA-Chaotic Fault Generator
Operation Guide to TSA-Digital Assets
Permission Management
API Documentation
History
Introduction
API Category
Making API Requests
Other APIs
Task APIs
Cloud Architecture Console APIs
Data Types
Error Codes
FAQs
FAQs: TSA
FAQs: TSA-Cloud Risk Assessment
FAQs: TSA-Cloud Architecture
FAQs: TSA-Chaotic Fault Generator
Related Protocol
Tencent Cloud Smart Advisor Service Level Agreement
PRIVACY POLICY MODULE CHAOTIC FAULT GENERATOR
DATA PRIVACY AND SECURITY AGREEMENT MODULE CHAOTIC FAULT GENERATOR
Contact Us

CKafka Broker Node Failure

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-03-31 23:00:19

Background

A Kafka cluster consists of multiple servers, each of which is called a Broker node. Although Kafka is designed to provide high availability and fault tolerance, Broker node failures are sometimes unavoidable in practical applications. The following are some common scenarios:
Hardware faults: Kafka relies heavily on disk performance, and unstable power supply or power failures may also result in sudden Broker node failures.
Network issues: Network connection issues or network hardware faults may also cause Broker nodes to fail to communicate with other nodes in the cluster, resulting in Broker node failures.
Memory leaks: Long-running processes may exhaust available memory due to memory leaks, resulting in Broker node failures.
File descriptor exhaustion: Kafka heavily relies on file operations. If file descriptors are exhausted, it may also result in service unavailability.
When the above-mentioned unknown errors occur, Broker nodes may fail, resulting in message backlog and unavailability. TSA-Chaotic Fault Generator (TSA-CFG) provides fault simulation actions for such situations by proactively shutting down Broker nodes to achieve the failure effect. Users can verify that the business can operate normally after recovery from data loss in extreme cases, and understand these impacts to adopt appropriate response policies to maintain the stability of the Kafka cluster.

Experiment Preparation

Prepare a CKafka Professional Edition instance available for experiments, which has a topic and certain traffic.
Download the corresponding version of the client from Apache Kafka, decompress it, and use the shell scripts in the bin directory as the producer and consumer for testing.
Note:
Download the corresponding version of the client based on the instance version. For version 2.8.1 instances, you can install the client using the following command:
wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz

Step 1: Creating an Experiment

1. Log in to the Tencent Cloud Smart Advisor (TSA) console, choose Architecture Governance, select Governance Mode, and click CFG. (For details about how to create an experiment, see Using TSA to Execute a Chaos Experiment on CFG.)
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Middleware > Ckafka from the Experiment Instance drop-down list, click Add via Search, and add instance resources. Alternatively, click Add via Architecture Diagram, click the Ckafka resources on the architecture diagram, select the required instance, and add it.
4. After the instance is added, click Add Action, select Broker Node Fault as the experiment action, and click Next.
5. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.

Step 2: Executing the Experiment

1. Start the kafka-producer-perf-test.sh script in the bin directory to produce messages.

Start the kafka-consumer-perf-test.sh script to consume messages.

2. Go to the experiment details panel, and click Execute in the fault action group or Start Experiment in the lower part of the panel to inject a fault.
3. Observe the fault impact and discover that the producer disconnects due to the lack of a Leader node.

The Broker node survival rate decreases, and another Broker node in another AZ is elected as the new Leader node.
Observe the number of unsynchronized replicas and discover a sudden increase.
4. After HA is completed, the instance recovers to normal production and consumption.

5. Execute the recovery action to restart the shutdown Broker node. The Broker node recovers operation, and the unsynchronized replicas have been synchronized.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백