Overview
Partitioned tables (Partition Table) are the core means for TDSQL Boundless to handle massive data and achieve horizontal scaling. Unlike single tables, partitioned tables distribute data across multiple storage nodes in a cluster through partition keys (Partition Key). This enables the database to leverage the overall computing power and storage space of the cluster, breaking through the performance bottlenecks of a single machine.
When Should Partitioned Tables Be Used?
If your business meets any one of the following characteristics, be sure to choose partitioned tables:
1. Large data volumes or high-concurrency writes: requiring extremely high write throughput (TPS) beyond the capacity of a single machine.
2. Unlimited growth: The business is in a rapid development phase, unable to estimate the future data upper limit.
Select a Partitioned Strategy
TDSQL Boundless supports four main partitioning policies. Selecting the correct policy is the most critical step in table creation.
1. Partition using HASH
Introduction: Performs modulo operation on the partition key value to randomly and evenly scatter the data.
Applicable scenarios: The most recommended default choice. Suitable for scenarios where the partition key is an integer type (INT/BIGINT), such as user IDs or purely numeric order numbers.
Advantages: Excellent load balancing; high write performance.
2. Partition using KEY
Introduction: Similar to HASH partitioning but supports multiple data types except BLOB/TEXT. The system processes data using an internal hash function.
Applicable scenarios: Scenarios where the partition key is a string type (VARCHAR/CHAR), such as UUIDs, ID card numbers, or business serial numbers (String).
Advantages: Solves the issue where strings cannot be directly processed by HASH modulo operations, eliminating the need for conversion at the business layer (application layer).
3. RANGE partitioning
Introduction: Data is divided based on continuous ranges of key values (such as time periods and ID ranges).
Applicable scenarios: Time-series data (logs, transactions, monitoring).
Advantages: Fast range queries, and expired historical data can be instantly deleted via DROP PARTITION.
4. LIST partitioning
Introduction: Divides data based on enumerated values (discrete list).
Applicable scenarios: Data with clear categorical attributes, such as regions (provinces/cities) or tenant IDs.
Advantages: Clear data categorization facilitates isolated management by category.
Risk: Data skew. If the data volume of a specific category (such as "Guangdong Province") far exceeds that of other categories, the corresponding node will become a bottleneck.
Syntax and Example
Scenario A: Standard E-Commerce Orders (Using HASH)
The partition key is a purely numeric ID.
CREATE TABLE `mall`.`orders` (
`order_id` bigint NOT NULL,
`user_id` bigint NOT NULL,
`amount` decimal(10,2),
`create_time` datetime DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`order_id`)
)
PARTITION BY HASH(`order_id`)
PARTITIONS 64;
Scenario B: User Center (Using KEY)
The partition key is a UUID string, and KEY partitioning automatically handles hashing.
CREATE TABLE `user_center`.`user_profiles` (
`user_uuid` varchar(36) NOT NULL,
`nick_name` varchar(50),
`register_time` datetime,
PRIMARY KEY (`user_uuid`)
)
PARTITION BY KEY(`user_uuid`)
PARTITIONS 64;
Scenario C: Operation Logs (Using RANGE)
Storing logs monthly facilitates regular cleanup of old data.
CREATE TABLE `sys`.`audit_logs` (
`log_id` bigint NOT NULL,
`content` text,
`log_time` datetime NOT NULL,
PRIMARY KEY (`log_id`, `log_time`)
)
PARTITION BY RANGE COLUMNS(`log_time`) (
PARTITION p202301 VALUES LESS THAN ('2023-02-01'),
PARTITION p202302 VALUES LESS THAN ('2023-03-01'),
PARTITION p_future VALUES LESS THAN (MAXVALUE)
);
Scenario D: By Geographic Distribution (Using LIST)
Businesses are explicitly partitioned by city, with data isolation between different cities.
CREATE TABLE `crm`.`customers` (
`cust_id` bigint NOT NULL,
`city_id` int NOT NULL,
`name` varchar(50),
PRIMARY KEY (`cust_id`, `city_id`)
)
PARTITION BY LIST(`city_id`) (
PARTITION p_bj VALUES IN (10),
PARTITION p_sh VALUES IN (20),
PARTITION p_others VALUES IN (30)
);
Critical Constraints
Partitioned tables have a mandatory hard constraint that must be adhered to: violating this constraint will result in table creation failure.
The primary key and unique index must include the partition key.
Error example:
CREATE TABLE `error_table` (
`id` bigint NOT NULL PRIMARY KEY,
`user_id` bigint NOT NULL
) PARTITION BY HASH(`user_id`);
Correct example:
CREATE TABLE `correct_table` (
`id` bigint NOT NULL,
`user_id` bigint NOT NULL,
PRIMARY KEY (`id`, `user_id`)
) PARTITION BY HASH(`user_id`);
Use Cases
1. the number of partitions
Policy: Pre-sharding should not only consider the current number of physical nodes, but also account for future cluster scale and expected total data volume.
Recommended value: It is recommended to set it to the maximum expected future number of nodes * 2, which ensures sufficient data dispersion.
2. Avoid modifying the partition key.
Try to avoid using UPDATE statements to modify the value of the partition key. This may cause data rows to migrate between physical nodes (Delete + Insert) at a high cost.
3. Query Optimization
Include the partition key: When querying, always include the partition key in the WHERE clause (for example, WHERE user_id = ?). This allows TDSQL Boundless to directly route the query to the specific node, delivering optimal performance.
Avoid full table scans: Queries without the partition key will trigger broadcast scans (Scatter-Gather), querying all nodes and aggregating the results, which consumes significant resources.
Frequently Asked Questions (FAQ)
What Is the Difference Between HASH and KEY Partitioning?
A: HASH primarily handles integers, with a simple and straightforward algorithm; KEY primarily handles strings (also supports integers), internally using a complex hash function. For string IDs, directly use KEY partitioning.
Q: Can I Use Multiple Columns as the Partition Key?
A: Yes, for example PARTITION BY HASH(col1 + col2) or KEY(col1, col2). However, this increases complexity, and it is generally recommended to use a single high-cardinality column as the partition key.
Q: Can the partition key or partitioning type be modified after table creation?
A: Strongly discouraged. Although syntactically supported, this essentially constitutes a "reshuffle" of the entire dataset, consuming massive system resources (IO/CPU) and potentially causing prolonged business disruption.