Storing data in partition catalogs can greatly reduce the scanned data volume of a computing task in Data Lake Compute and thereby significantly enhance the computing performance. The general practice of data partitioning is to store data in different catalogs by time. For example, data generated on the same day can be stored in the same catalog, and catalogs can be organized in a "year-month-day" structure. In Data Lake Compute, a table and its partitions must adopt the same data format.
To create a partition table, you need to specify the partition field in the table creation statement.
Specifying a partition during data table creation is only to configure the partition field and doesn't allow running a query statement immediately to get data. You need to add partitioned data to a data table. If new partitioned data is added to the data catalog, you also need to add the partition information to the data table.
ALTER TABLE ADD PARTITION statement to add a specified partition catalog to a data table. If the partition catalog is compatible with the Hive partitioning rule (partition column name=partition column value), you don't need to specify the data path; otherwise, you need to.
Sample 1: Adding a single partition catalog
ALTER TABLE tabel_demo ADD PARTITION (dt = '2021-01-01');
Sample 2: Adding multi-level nested partition catalogs
ALTER TABLE tabel_demo ADD PARTITION (year = '2021', month='01', day='01');
Sample 3: Displaying the specified partition path
ALTER TABLE tabel_demo ADD PARTITION (year = '2021', month='01', day='01') LOCATION 'cosn://tablea_demo' ;
MSCK REPAIR TABLE statement to scan the data catalog specified during table creation. If there is a new partition catalog, the system will automatically add the partitions to the metadata of the data table. Below is a sample:
MSCK REPAIR TABLE table_demo
MSCK REPAIR TABLEonly adds partitions to the metadata of the data table but does not delete them. To delete an added partition, run the
ALTER TABLE table-name DROP PARTITIONstatement.
MSCK REPAIR TABLEis not recommended if the data volume is large, as the system will scan all the data, which may take a long time, cause the task to time out, and make the partition information of the data table incomplete.
ALTER TABLE ADD PARTITIONto load a partition.
cosn://tablea_adata in table A and the
s3://table_a/table_bdata in table B are stored in COS and both tables are partitioned by string, then
MSCK REPAIR TABLEwill add partitions of table B to table A. To avoid this, use separate folder structures, such as