tencent cloud

TDMQ for RocketMQ

Release Notes and Announcements
Release Notes
Announcements
Product Introduction
Introduction and Selection of the TDMQ Product Series
What Is TDMQ for RocketMQ
Strengths
Scenarios
Product Series
Comparison with Open-Source RocketMQ
High Availability
Quotas and Limits
Supported Regions
Basic Concepts
Billing
Billing Overview
Pricing
Billing Examples
Pay-as-you-go Switch to Monthly Subscription (5.x)
Renewal
Viewing Consumption Details
Refund
Overdue Payments
Getting Started
Getting Started Guide
Preparations
Step 1: Creating TDMQ for RocketMQ Resources
Step 2: Using the SDK to Send and Receive Messages (Recommended)
Step 2: Running the TDMQ for RocketMQ Client (Optional)
Step 3: Querying Messages
Step 4: Deleting Resources
User Guide
Usage Process Guide
Configuring Account Permissions
Creating the Cluster
Configuring the Namespace
Configuring the Topic
Configuring the Group
Connecting to the Cluster
Managing Messages
Managing the Cluster
Viewing Monitoring Data and Configuring Alarms
Cross-Cluster Message Replication
Use Cases
Naming Conventions for Common Concepts of TDMQ for RocketMQ
RocketMQ Client Use Cases
RocketMQ Performance Load Testing and Capacity Assessment
Access over HTTP
Client Risk Descriptions and Update Guide
Migration Guide for TencentCloud API Operations Related to RocketMQ 4.x Cluster Roles
Migration Guide
Disruptive Migration
Seamless Migration
Developer Guide
Message Types
Message Filtering
Message Retries
POP Consumption Mode (5.x)
Clustering Consumption and Broadcasting Consumption
Subscription Relationship Consistency
Traffic Throttling
​​API Reference(5.x)
History
API Category
Making API Requests
Topic APIs
Consumer Group APIs
Message APIs
Role Authentication APIs
Hitless Migration APIs
Cloud Migration APIs
Cluster APIs
Data Types
Error Codes
​​API Reference(4.x)
SDK Reference
SDK Overview
5.x SDK
4.x SDK
Security and Compliance
Permission Management
CloudAudit
Deletion Protection
FAQs
4.x Instance FAQs
Agreements
TDMQ for RocketMQ Service Level Agreement
Contact Us

High Availability

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-01-23 17:09:49
Tencent Cloud TDMQ for RocketMQ adopts a multi-primary architecture with cross-availability zone (AZ) deployment to ensure high service availability, while utilizing a three-replica cloud disk mechanism to guarantee data high availability. This architecture abandons the traditional primary-secondary mode, significantly streamlining Ops complexity while delivering financial-grade reliability.

Cluster-Level High Availability

The core objective of cluster-level high availability is to ensure uninterrupted read and write operations for the messaging service, even if certain components fail. Tencent Cloud TDMQ for RocketMQ builds a highly available service cluster with no single point of failure and regional-level disaster recovery capability through the combination of NameServer cluster, stateless proxy, multi-primary broker, and cross-AZ deployment.

NameServer Clustering and Cross-AZ Deployment

NameServer serves as the "brain" of RocketMQ, responsible for service discovery and routing management. It must be highly available by design. In the Tencent Cloud solution:
Clustered deployment: At least two NameServer nodes are deployed to form a stateless cluster.
Cross-AZ deployment: These NameServer nodes are distributed across multiple AZs within the same region. AZs are physically isolated data centers. A failure in one AZ does not impact others.
Thus, any failure of a single NameServer node or even an entire AZ does not impact the retrieval of routing information. Producers and consumers can still retrieve the list of broker addresses from other healthy NameServer nodes.

Multi-primary Broker Architecture

Unlike the traditional primary-secondary architecture, the multi-primary architecture eliminates hierarchical roles. All broker nodes operate as primary nodes and can handle message write requests from producers at any time.

Read-Write Load Balancing

When sending messages, the producer retrieves the list of all available primary brokers from the NameServer and writes messages to different brokers using strategies such as round-robin scheduling, naturally achieving load balancing for write traffic.

Seamless Failover

Single node failure: If a broker node loses heartbeat with the NameServer due to crashes or network issues, the NameServer immediately removes it from the route table.
Automatic client retry: Producers and consumers periodically update the broker list from the NameServer. When they detect an unavailable broker, they automatically skip the node and seamlessly switch requests (such as message sending and pulling) to other healthy broker nodes in the cluster.
The entire process is completely transparent to business applications and requires no manual intervention, achieving failover within seconds.

Cross-AZ Deployment for Brokers

To mitigate data center-level disasters, we deploy multiple primary broker nodes across multiple AZs. A cross-AZ distribution policy is enforced based on the AZs you have selected.

Failure Scenario Simulation

Assume that a cluster is deployed across three AZs (AZ1, AZ2, and AZ3) in the Guangzhou region, with broker nodes distributed in each AZ.

Scenario 1: A Broker Server in AZ1 Failed

The NameServer detects a heartbeat timeout and removes this broker from the routing information.
New production and consumption requests are automatically routed to other brokers in AZ1 as well as brokers in AZ2 and AZ3, ensuring uninterrupted service.

Scenario 2: The Entire AZ1 Unavailable Due to a Power or Network Failure

All NameServer and broker nodes in AZ1 go offline.
Since healthy NameServer and broker nodes still remain in AZ2 and AZ3, clients will connect to these nodes.
The overall service capacity of the cluster will be temporarily degraded, but core message production and consumption features remain available, ensuring business continuity.

Cross-AZ Deployment for Proxies

For the 5.x product forms, similar to broker deployment, we enforce cross-AZ distribution of proxies based on the AZs you have selected. Leveraging the stateless nature of proxies means:
No business data storage: Persistent status information such as message data and consumer offsets remains stored on the brokers' cloud disks.
No long-lived session status: Proxies do not maintain critical, non-recoverable session information. Any proxy node can handle requests from any client.
At the deployment level, this is achieved through a multi-node cluster combined with a frontend load balancer (LB):
Deploy multiple proxy instances: We deploy at least 3 or more proxy instances, distributed across different AZs, similar to NameServers and brokers.
Configure a frontend LB: We configure a load balancer, such as Tencent Cloud's Cloud Load Balancer (CLB), in front of all proxy instances. This LB provides a single virtual IP (VIP) address externally.
Clients connect to the VIP: All producer and consumer clients are configured to connect solely to this unified LB VIP address, rather than connecting directly to NameServers or brokers.

Data High Availability

Data high availability is the lifeline of a message queue, with the core objective of ensuring that successfully written message data is never lost due to any hardware failure. Traditional RocketMQ relies on primary-secondary synchronous replication (SYNC_FLUSH) to guarantee zero data loss, but this approach introduces significant performance overhead and complex primary-secondary switch procedures.
Tencent Cloud leverages IaaS-layer capabilities to inherently resolve data persistence high availability through the three-replica mechanism of Cloud Block Storage (CBS).
The "brokers + three-replica cloud disks" architecture represents a typical compute-storage separation architecture. This model transforms RocketMQ brokers into stateless compute nodes, while delegating stateful data storage to a professional, highly available distributed block storage service, thereby achieving high data availability.

What Are Three-Replica Cloud Disks?

CBS is a network block storage device provided by Tencent Cloud, offering high availability, high reliability, and low latency. One of its core features is the three-replica data mechanism.
When a RocketMQ broker writes a piece of message data (such as CommitLog or ConsumeQueue) to its mounted cloud disk, the underlying storage system of the cloud disk automatically and synchronously writes three physical copies of this data to three different physical racks within the same AZ.
The write operation returns success to the upper-layer application (the RocketMQ broker) only after all three copies have been successfully written.
This process is completely transparent to the upper-layer application. The broker simply performs what appears to be a standard local disk write operation.

How Do Three-Replica Cloud Disks Ensure Data High Availability?

This architecture shifts the guarantee of data reliability from the application layer (RocketMQ primary-secondary replication) down to a more reliable and efficient infrastructure layer (distributed storage).

Immune to Single-Disk/Single-Machine Failures

If the physical disk or server hosting any data replica fails, the system automatically recovers the data from the two other healthy replicas and creates a new replica in a new location, always maintaining the three-replica status. The entire process is transparent to services, with zero data loss.

Simplified Architecture, Eliminating Primary-Secondary Replication

Since high availability of data is already achieved at the storage layer, we no longer need to deploy secondary nodes or configure complex primary-secondary synchronous replication. This brings significant strengths:
No replication latency: This eliminates data synchronization latency between primary and secondary nodes.
Simplified Ops: This eliminates the need to handle complex Ops scenarios such as primary-secondary switch and data replenishment.
Rapid recovery: When a broker node (virtual machine) fails, we only need to start a new broker instance and remount the original cloud disk. Since all message data remains intact on the cloud disk, the broker instance can resume service immediately, significantly reducing the recovery time objective (RTO).

Containerized Deployment

Building upon the high-availability architecture described above, if we containerize the entire TDMQ for RocketMQ cluster (including NameServer and broker) and deploy it on a container orchestration platform represented by Tencent Kubernetes Engine (TKE), we can achieve standardized delivery, rapid scaling, and automatic recovery in abnormal scenarios.


도움말 및 지원

문제 해결에 도움이 되었나요?

피드백