Performance testing and capacity assessment are basic methods to ensure overall system stability. However, performing an effective stress test is not a simple task. We recommend the following stress testing approach. Before conducting the test, first clarify the purpose of this stress test based on the business scenario. Then determine the specific test plan, including test objects, load testing scenario simulation, tool selection, and key metrics to monitor. Finally, analyze whether the test meets expectations and proceed to purchase suitable specifications.
Clarifying Stress Test Targets
As shown above, for RocketMQ business scenarios, the focus is generally on stress testing message sending or message consumption.
If stress testing message sending, mainly focus on sending rate, duration, and success rate, as well as application performance when traffic throttling is triggered at peak value.
If stress testing message consumption, mainly focus on consumption rate, duration, success rate, retry policy after failure, and the impact of accumulated message delay on business.
Analyzing Stress Test Objects
In the scenario of stress testing message sending, the main test object is the RocketMQ instance. Focus on the sending duration and success rate of the RocketMQ instance. Taking the RocketMQ 5.x instance as an example, distributed rate limiting is enabled by default to prevent overwhelming the cluster with excessive traffic. Therefore, pay attention to the impact of throttling on business.
In the scenario of stress testing message consumption, the main test object is the downstream consumer application. Focus on the message consumption capability of the business downstream. Key indicators include consumption processing time, concurrent thread count, whether consumption times out, whether retries occur due to exceptions, and whether a message backlog is generated.
Simulating Stress Test Scenarios
For the scenario of sending messages, there are usually two methods. The first method can use the built-in stress test script in Apache RocketMQ open-source code to generate performance testing traffic. The second method can also simulate business traffic by using the producer application with business logic code for end-to-end stress testing.
Normally, we recommend first using the built-in stress test script in the open-source code to quickly perform stress tests and get some benchmark indicators of the RocketMQ instance. This helps build confidence and ensures the RocketMQ instance itself meets standards. Then, combine it with the business model, simulate performance testing task traffic, set reasonable concurrency, and ensure the final stress test results meet business needs.
For the scenario of consuming messages, there are usually two methods. The first method can use the built-in stress test script in Apache RocketMQ open-source code to subscribe to the test topic and immediately acknowledge messages upon receipt, confirming the RocketMQ instance provides sufficient consumption capacity. The second method also involves end-to-end stress testing, where the upstream sender sends messages compliant with the consumer's business code requirements, ensuring the consumption logic is covered by testing, and even further propagating the stress test traffic downstream.
Analyzing Performance Testing Metrics
Send metrics
Focus on sending rate, duration, and success rate, whether throttling is triggered, its impact on business, or the retry policy.
Consumption metrics
Focus on consumption rate, business duration, delay in message consumption, and success rate, as well as the impact of message backlog and delay on business.
Common Issue Analysis
How to Troubleshoot Low Sending Rate
The core factors that determine the sending rate are two: sending duration and degree of concurrency. If the average sending duration is 5ms and the degree of concurrency is 1, the sending rate is 200 TPS. Therefore, if the stress test target is not met, first confirm the sending duration—for example, whether the network uses a public network or goes through a proxy, resulting in relatively high sending duration. If the sending duration behaves as expected, focus on troubleshooting whether the degree of concurrency is met, whether the sender's parallel number of threads are sufficient, whether the sender's node workload is normal, or whether higher-level factors such as locks influence performance.
How to Efficiently Simulate Traffic Volume for Downstream Business Stress Testing
To achieve good stress test results, the test traffic needs to closely resemble actual business traffic. To simulate traffic, in addition to full-link stress testing, you can reset the consumer offset to replay historical messages, efficiently generating traffic for downstream businesses. This eliminates the need for upstream services to repeatedly send traffic.
How to Analyze the Causes for Trigger Throttling
Since RocketMQ 5.x instances have traffic throttling enabled by default, if throttling is triggered during stress testing, focus on the following reasons:
1. In scenarios with "micro bursts," for example, our monitoring has minute granularity, and all traffic may be concentrated in the first second. In fact, our rate-limiting token window updates every 10 seconds, which can cause minute-level monitoring to show no throttling threshold exceeded, while the 10-second granularity is actually rate-limited.
2. The message body is too large. Because throttling converts messages at 4KB per message, for example, a 100KB message will be converted into 25 messages for rate limiting. Therefore, the throttling value and the number of messages produced are not one-to-one.
3. The traffic throttling ratio is inappropriate. We provide the option to adjust the sending and consumption throttling quota ratio, which defaults to 5:5 and can be adjusted to a maximum of 2:8 or 8:2. Therefore, if throttling is triggered, check whether the configured throttling quota ratio is reasonable.