tencent cloud

TDSQL Boundless

Release Notes
Product Introduction
Overview
Scenarios
Product Architecture
Instance Types
Compatibility Notes
Kernel Features
Kernel Overview
Kernel Version Release Notes
Functionality Features
Performance Features
Billing
Billing Overview
Purchase Method
Pricing Details
Renewal
Overdue Payments
Refund
Getting Started
Creating an Instance
Connect to Instances
User Guide
Data Migration
Data Subscription
Instance Management
Configuration Change
Parameter Configuration
Account Management
Security Group
Backup and Restoration
Database Auditing
Tag Management
Use Cases
Technical Evolution and Usage Practices of Online DDL
Lock Mechanism Analysis and Troubleshooting Practices
Data Intelligent Scheduling and Related Practices for Performance Optimization
TDSQL Boundless Selection Guide and Practical Tutorial
Developer Guide
Developer Guide (MySQL Compatibility Mode)
Developer Guide (HBase Compatibility Mode)
Performance Tuning
Performance Tuning Overview
SQL Tuning
DDL Tuning
Performance White Paper
Performance Overview
TPC-C Test
Sysbench Test
API Documentation
History
Introduction
API Category
Making API Requests
Instance APIs
Security Group APIs
Task APIs
Backup APIs
Rollback APIs
Parameter APIs
Database APIs
Data Types
Error Codes
General Reference
System Architecture
SQL Reference
Database Parameter Description
TPC-H benchmark data model reference
Error Code Information
Security and Compliance
FAQs
Agreements
Service Level Agreement
Terms of Service
Privacy Policy
Data Processing And Security Agreement
Contact Us
Glossary

Data Import Practice Tutorial

PDF
Modo Foco
Tamanho da Fonte
Última atualização: 2026-03-06 18:50:07
To achieve optimal performance for the bulk load import mode and maximize the utilization of system resources, we have summarized the tuning experience accumulated during practice below for your reference.

Sequential Import

Ordered import refers to importing data in ascending order of the primary key, allowing the database to process the data more efficiently. Compared with disordered import, ordered import improves performance by 20%~300%.
Here, the order is divided into two levels of order.
bulk-load transactions are internally ordered: Within a bulk-load transaction, data must be arranged in ascending order of the primary key.
Orderable bulk-load transactions: Taking the minimum and maximum primary keys of the data written by a transaction as its range, if the ranges of transactions do not overlap, they are considered orderable. In other words, when the data written by two transactions can be sorted in ascending order of the primary key without any interleaving, these two transactions are orderable.
When both of the above ordering conditions are met simultaneously, the data import is considered fully ordered, achieving optimal performance.
If the first condition is not met, meaning the individual records within a transaction are not sorted by primary key, the parameter tdsql_bulk_load_allow_unsorted must be enabled before import; otherwise, the operation will fail with an error. When the parameter tdsql_bulk_load_allow_unsorted is enabled, data within each transaction is first staged in temporary files. During the commit phase, an external merge sort is performed before sequentially writing to external SST files. If the data within a transaction is already ordered, this step is skipped, and data is written directly to external SST files. Consequently, unordered transactions consume more CPU and I/O resources than ordered ones.
If the second condition is not met, the external SST files generated between transactions will have overlapping ranges. When they are inserted into the LSM-tree, this overlap prevents finding suitable insertion positions in lower levels, causing more SST files to be ultimately inserted into the top level (level-0). This triggers compaction operations to lower levels, consuming additional I/O resources during bulk-load data import and adversely affecting overall performance.

Concurrent Import

When disk performance is good (such as NVMe disks), the bottleneck for bulk-load performance is more likely to be CPU-bound rather than I/O-bound. You can try appropriately increasing the concurrency of import threads (50~100) to improve import efficiency.
The system supports configuration via tdstore_bulk_load_sorted_data_sst_compression, tdstore_bulk_load_unsorted_data_sst_compression,
The tdstore_bulk_load_temp_file_compression parameter (requires super administrator privileges) adjusts the compression algorithm used for external SST files and temporary data files generated during the bulk-load import process to balance CPU and I/O resources with disk space utilization.
SET GLOBAL tdstore_bulk_load_sorted_data_sst_compression=zstd,snappy,lz4,auto,nocompression,...;

Control of Transaction Size

Unlike regular transactions that temporarily store uncommitted data in a write batch for in-memory storage, bulk-load transactions directly write uncommitted data to external SST files. This approach prevents large transactions from causing excessive memory consumption and OOM (Out Of Memory) issues.
Conversely, bulk-load transactions should not be too small in scale. Otherwise, they would generate smaller SST files on one hand, and frequent bulk-load transaction commits would also result in suboptimal import performance on the other.
Typically, a bulk-load transaction with a data volume of 200MB or more delivers optimal performance. Note that compression is enabled by default for external SST files. Taking ZSTD as an example, compression ratios can reach 10:1 or higher compared to raw data. Therefore, it is recommended to have a bulk-load transaction write raw data volumes exceeding 2GB.

Automatic Transaction Batching and Committing

For large-scale data migration imports with higher performance requirements and no read requests during the migration phase, consider enabling auto-batching and committing for bulk-load transactions: SET GLOBAL tdsql_bulk_load_allow_auto_organize_txn = ON; This allows the database to automatically accumulate bulk-load transaction data to an optimal scale before initiating commits, maintaining high import performance. Note that when this option is enabled, data corresponding to a client's successful commit request may still be in the batching phase and not yet reached the auto-commit threshold, thus potentially not immediately visible. It is recommended to perform row count verification after the import is completed.

Secondary index

For tables with secondary indexes, bulk-load performance degrades as the number of secondary indexes increases. This occurs because secondary index data is inherently unordered, requiring additional sorting operations that introduce computational overhead during the import process. To optimize scenarios involving secondary indexes, it is recommended to appropriately increase the size of bulk-load transactions.
In versions 21.x or later, a more recommended approach is: when importing data via bulk-load, create tables without secondary indexes initially. After completing the import, create secondary indexes with fast online DDL optimization enabled (requires manual activation), which delivers better overall performance.

Partition Table

From the computational layer perspective, each partitioned table is assigned a unique t_index_id. The actual processing logic is essentially similar to handling a regular table. It is necessary to consider scenarios where a bulk load transaction spans multiple partitioned tables.
For range partitioning with sequential data import, the data within a bulk-load transaction typically concentrates within a single range partition or a small number of adjacent range partitions. In such cases, a bulk-load transaction generally does not involve too many partitioned tables, similar to scenarios with regular tables, and therefore no additional processing is required.
For hash partitioning with sequential data import, the data within a bulk-load transaction is evenly distributed across all hash partitions. In this scenario, a single bulk-load transaction will involve nearly all partitioned tables. It is necessary to further increase the size of bulk-load transactions to ensure that the data volume allocated to each hash partition within a transaction is not too small. Assuming there are N hash partitions, it is recommended to scale the bulk-load transaction size by a factor of N compared to regular table scenarios. Currently, in scenarios with 16 hash partitions, performance experiences an approximately 15% reduction compared to non-partitioned tables (regular tables).
If sub-partitions exist, it is necessary to consider how many sub-partitions the data within a bulk load transaction will involve, ensuring that the data volume allocated to each sub-partition is not too small.
In summary, as the number of partitions involved in a bulk load transaction increases, performance gradually degrades. From the storage layer perspective, this occurs because a single bulk load transaction engages too many RGs as transaction participants. Similar to regular transactions, performance deteriorates as more partitions are involved, since a greater number of participants at the RG level negatively impacts efficiency.

Ajuda e Suporte

Esta página foi útil?

comentários