Background
Message middleware often plays a critical role in distributed systems. However, in actual production environments, various factors can lead to high CPU load on Broker nodes. Here are some common scenarios:
High message throughput: If a topic or partition in a CKafka cluster receives very high message throughput, the Broker nodes need to handle a large number of read and write operations.
Large number of consumer groups: If a large number of consumer groups subscribe to the same topic or partition, the Broker nodes need to handle message distribution and management for each consumer group.
Replication and synchronization: If the data replication and synchronization feature is enabled in the CKafka cluster, the Broker nodes need to handle replicated read and write operations and synchronize with other Broker nodes.
Compression and decompression: If messages are stored in compressed format, the Broker nodes need to compress and decompress them, which may consume a significant amount of CPU resources.
Index and log compression: CKafka uses indexes to accelerate message lookup. If the index volume is too large or needs to be compressed, the Broker nodes need to maintain and compress the indexes.
High concurrent connections: If a large number of producers and consumers want to connect to the Broker nodes, the Broker nodes need to establish and maintain connections, increasing CPU load.
When Broker nodes are under high CPU load, the following issues may occur:
Increased latency: High CPU load may slow down message processing, thereby increasing message transmission and processing latency. This lowers consumers' speed to read messages from CKafka, which may prevent consumers from obtaining the latest messages in a timely manner.
Decreased throughput: Since CPU resources are consumed by high-load tasks, CKafka Broker nodes may not process additional messages, resulting in a decreased in overall throughput. This reduces producers' speed of sending messages and consumers' speed of consuming messages.
Network congestion: High CPU load may prevent CKafka Broker nodes from processing network requests promptly, leading to network congestion. This affects data replication and synchronization with other Broker nodes, potentially causing increased replication latency or untimely data synchronization.
Increased response time: Due to high CPU load, CKafka Broker nodes may fail to respond promptly to client requests, resulting in increased wait time for clients. This affects the performance and response time of applications accessing the CKafka cluster.
To prevent these issues, TSA-Chaotic Fault Generator (TSA-CFG) provides a high CPU load experiment action of CKafka Broker nodes to test the response and recovery capabilities of business systems when they are facing unexpected situations such as latency caused by high load on CKafka Broker nodes, thereby enhancing the security and stability of the business.
Must-Knows
Instance type: This action only enables fault injection capabilities for CKafka Professional Edition instances. CKafka Standard Edition instances are not currently supported for experiments.
Instance status: It is recommended that instances undergoing experiments have active message production and consumption traffic, with more than 3 topic partitions. This enables users to better observe the impact of faults on the business. (Optional)
Experiment Preparation
Prepare a CKafka Professional Edition instance available for experiments.
Step 1: Creating an Experiment
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Middleware > Ckafka from the Experiment Instance drop-down list, click Add via Search, and add instance resources. Alternatively, click Add via Architecture Diagram, click the Ckafka resources on the architecture diagram, select the required instance, and add it.
4. After the instance is added, click Add Action, select Broker High CPU Load as the experiment action, and click Next.
5. Set action parameters. For example, select a CPU load rate of 80% and a duration of 200s, and then click OK.
6. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.
Step 2: Executing the Experiment
1. Observe the instance monitoring data before the experiment. You can go to the TDMQ for CKafka console to view the monitoring metrics in Advanced Monitoring. 2. Go to the experiment details panel, and click Execute in the fault action group or Start Experiment in the lower part of the panel to inject a fault.
3. During fault injection, you can click the link in the log to go to Advanced Monitoring for observation.
4. Observe that the CPU utilization has reached the set value.
5. After the fault is injected, click Recovery Action to recover from the injected fault.