tencent cloud

Data Lake Compute

Release Notes
Product Introduction
Overview
Strengths
Use Cases
Purchase Guide
Billing Overview
Refund
Payment Overdue
Configuration Adjustment Fees
Getting Started
Complete Process for New User Activation
DLC Data Import Guide
Quick Start with Data Analytics in Data Lake Compute
Quick Start with Permission Management in Data Lake Compute
Quick Start with Partition Table
Enabling Data Optimization
Cross-Source Analysis of EMR Hive Data
Standard Engine Configuration Guide
Configuring Data Access Policy
Operation Guide
Console Operation Introduction
Development Guide
Runtime Environment
SparkJar Job Development Guide
PySpark Job Development Guide
Query Performance Optimization Guide
UDF Function Development Guide
System Restraints
Client Access
JDBC Access
TDLC Command Line Interface Tool Access
Third-party Software Linkage
Python Access
Practical Tutorial
Accessing DLC Data with Power BI
Table Creation Practice
Using Apache Airflow to Schedule DLC Engine to Submit Tasks
Direct Query of DLC Internal Storage with StarRocks
Spark cost optimization practice
DATA + AI
Using DLC to Analyze CLS Logs
Using Role SSO to Access DLC
Resource-Level Authentication Guide
Implementing Tencent Cloud TCHouse-D Read and Write Operations in DLC
DLC Native Table
SQL Statement
SuperSQL Statement
Overview of Standard Spark Statement
Overview of Standard Presto Statement
Reserved Words
API Documentation
History
Introduction
API Category
Making API Requests
Data Table APIs
Task APIs
Metadata APIs
Service Configuration APIs
Permission Management APIs
Database APIs
Data Source Connection APIs
Data Optimization APIs
Data Engine APIs
Resource Group for the Standard Engine APIs
Data Types
Error Codes
General Reference
Error Codes
Quotas and limits
Operation Guide on Connecting Third-Party Software to DLC
FAQs
FAQs on Permissions
FAQs on Engines
FAQs on Features
FAQs on Spark Jobs
DLC Policy
Privacy Policy
Data Privacy And Security Agreement
Service Level Agreement
Contact Us

CREATE TABLE

PDF
Focus Mode
Font Size
Last updated: 2024-08-07 17:12:15

Description

Supported engines: Presto and SparkSQL
Applicable table: Native Iceberg tables and external tables
Purpose: Create a table with some attributes. CREATE TABLE AS can also be used.
Storage path for the created table: The path can be a COS directory but not a file.

Statement for External Table

Statement

CREATE TABLE [ IF NOT EXISTS ] table_identifier
( col_name[:] col_type [ COMMENT col_comment ], ... )
USING data_source
[ COMMENT table_comment ]
[ OPTIONS ( 'key1'='value1', 'key2'='value2' )]
[ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]
[ LOCATION path ]
[ TBLPROPERTIES ( property_name=property_value, ... ) ]

Parameter

USING data_source: Data source used during table creation. Currently, the value of data_source can be CSV, ORC, PARQUET, and ICEBERG. table_identifier: Table name in the three-part format. Example: catalog.database.table. COMMENT: Table description
OPTIONS: Additional parameters of USING data_source, which are used to pass dynamic values during storage. PARTITIONED BY: Create partitions based on specified columns. LOCATION path: Storage path of the data table. TBLPROPERTIES: A set of key-value pairs used to specify table parameters.

Detailed Description of USING and OPTIONS

USING CSV
USING ORC
USING PARQUET
Reference: Working with CSV.
Configurations supported by CSV data tables are as follows.
Key Supported by OPTIONS
Corresponding Default Value
Description
sep or delimiter
,
Delimiter between columns for storage in CSV files. The comma is used by default.
mode
PERMISSIVE
Processing mode when data conversion does not meet expectations.
PERMISSIVE: A more relaxed mode, which is the default mode. If extra columns exist after the data conversion of a certain row, only the required columns are adopted automatically.
DROPMALFORMED: Discard data that does not meet expectations. For example, if a row has an extra column, the row will be discarded.
FAILFAST: Strictly follow the CSV format. The conversion fails if the result does not meet expectations, such as extra columns in a row.
encoding or charset
UTF-8
String encoding scheme
Example: UTF-8, US-ASCII, ISO-8859-1, UTF-16BE, UTF-16LE, and UTF-16
quote
\\"
Indicate whether the quotation marks are single or double. Remember to use an escape character.
escape
\\\\
Escape character
charToEscapeQuoteEscaping
-
Escape characters required inside quotation marks
comment
\\u0000
Remarks.
header
false
Indicate whether a header exists.
inferSchema
false
Indicate whether to infer the data type of each column. If the inference is not performed, the data type of each column is string.
ignoreLeadingWhiteSpace
Read: false
Write: true
Ignore the leading empty string.
ignoreTrailingWhiteSpace
Read: false
Write: true
Ignore the trailing empty string.
columnNameOfCorruptRecord
_corrupt_record
Name of the column where data cannot be converted. If the value is inconsistent with that of spark.sql.columnNameOfCorruptRecord, the value specified in the CREATE TABLE statement prevails.
nullValue
-
Storage format for null values, which are empty strings by default. In this case, use the format defined by emptyValue.
nanValue
NaN
Storage format for non-numeric values
positiveInf
Inf
Storage format for positive infinity values
negativeInf
-Inf
Storage format for negative infinity values
compression or codec
-
Name of the compression algorithm. No compression is performed by default. Abbreviations such as bzip2, deflate, gzip, lz4, and snappy can be used.
timeZone
Default system time zone
Default time zone. If the value is inconsistent with that of spark.sql.session.timeZone (example: Asia/Shanghai), the value specified in the CREATE TABLE statement prevails.
locale
en-US
Language
dateFormat
yyyy-MM-dd
Default date format
timestampFormat
yyyy-MM-dd'T'HH:mm:ss.SSSXXX
Default time format, which is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] in non-LEGACY mode.
multiLine
false
Allow multiple rows.
maxColumns
20480
Maximum number of columns
maxCharsPerColumn
-1
Maximum number of characters per column. -1 indicates that the maximum number is not limited.
escapeQuotes
true
Escape quotation marks
quoteAll
quoteAll
Enclose all values in quotation marks.
samplingRatio
1.0
Sampling ratio
enforceSchema
true
Forcibly apply the specified schema for reading, and ignore the table headers.
emptyValue
Read:
Write: \\"\\"
Read and write formats for empty values.
lineSep
-
Line break.
inputBufferSize
-
Buffer size for read. If the value is inconsistent with that of spark.sql.csv.parser.inputBufferSize, the value specified in the CREATE TABLE statement prevails.
unescapedQuoteHandling
STOP_AT_DELIMITER
Policy for handling unescaped quotation marks
STOP_AT_DELIMITER: Stop reading when a delimiter is found.
BACK_TO_DELIMITER: Back to the delimiter.
STOP_AT_CLOSING_QUOTE: Stop reading when a closing quotation mark is found.
SKIP_VALUE: Skip this column of data.
RAISE_ERROR: Report an error.

Configurations supported by ORC data tables are as follows:
Key Supported by OPTIONS
Corresponding Default Value
Description
compression or orc.compress
snappy
Compression algorithm. Abbreviations such as snappy, zlib, lzo, lz3, and zstd are supported. If the value is inconsistent with that of spark.sql.orc.compression.codec, the value specified in the CREATE TABLE statement prevails.
mergeSchema
false
Merge schemas. If the value is inconsistent with that of spark.sql.orc.mergeSchema, the value specified in the CREATE TABLE statement prevails.
If you use HiveRead and HiveWriter (configuring spark.sql.hive.convertMetastoreOrc=false) for read and write, OPTIONS can also support native ORC configurations. For details, see LanguageManual ORC.
Most of the parameters related to PARQUET data tables can be configured using SparkConf (recommended). Configurations supported by OPTIONS are as follows:
Key Supported by OPTIONS
Corresponding Default Value
Description
compression or parquet.compression
snappy
Compression algorithm. Snappy is used by default. If the value is inconsistent with that of spark.sql.parquet.compression.codec, the value specified in the CREATE TABLE statement prevails.
mergeSchema
false
Indicate whether to merge schemas. If the value is inconsistent with that of spark.sql.parquet.mergeSchema, the value specified in the CREATE TABLE statement prevails.
datetimeRebaseMode
EXCEPTION
Policy of date conversion during PARQUET file write. Dates are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when dates are in different formats. If the value is inconsistent with that of spark.sql.parquet.datetimeRebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
int96RebaseMode
EXCEPTION
Policy of time conversion during PARQUET file read. Time values are converted based on the Gregorian calendar in LGACY mode but not converted in CORRECTED mode. In EXCEPTION mode, an error is reported when time values are in different formats. If the value is inconsistent with that of spark.sql.parquet.int96RebaseModeInRead, the value specified in the CREATE TABLE statement prevails.
If you use HiveRead and HiveWriter (setting spark.sql.hive.convertMetastoreParquet to false) for read and write, OPTIONS can also support native PARQUET configurations. Reference: Hadoop integration.

Example

CREATE TABLE dempts(
id bigint COMMENT 'id number',
num int,
eno float,
dno double,
cno decimal(9,3),
flag boolean,
data string,
ts_year timestamp,
date_month date,
bno binary,
point struct<x: double, y: double>,
points array<struct<x: double, y: double>>,
pointmaps map<struct<x: int>, struct<a: int>>
)
USING iceberg
COMMENT 'table documentation'
PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data))
LOCATION '/warehouse/db_001/dempts'
TBLPROPERTIES ('write.format.default'='orc');

Notes

Spark's USING and Hive's STORED AS are different when they are used as keywords of CREATE_TABLE. The file format and read method after table creation may not meet expectations. Note:
USING DATA_SOURCE: Used in Spark statement. This keyword indicates the data source used as the input format for table creation and directly affects the file format and reading method of the table under the location. The value can be CSV, TXT, Iceberg, Parquet, Orc, etc.
STORED AS FILE_FORMAT: This keyword is used in Hive statements to specify the file format for table storage. The value can be TXT, Parquet, Orc, etc. This keyword is not recommended because the native reader/writer of Spark may not support the specified file format such as CSV.

Statement for Native Iceberg Table

Caution
This statement is supported for creating native tables only.

Statement

CREATE TABLE [ IF NOT EXISTS ] table_identifier
( col_name[:] col_type [ COMMENT col_comment ], ... )
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name1, transform(col_name2), ... ) ]

Parameter

table_identifier: Table name in the three-part format. Example: catalog.db.name. Schemas and Data Types
col_type
: primitive_type
| nested_type

primitive_type
: boolean
| int/integer
| long/bigint
| float
| double
| decimal(p,s), p=maximum number of digits, s=maximum number of decimal places, s<=p<=38
| date
| timestamp, timestamp with timezone, time and without timezone are not supported.
| string, which corresponds to the UUID type of Iceberg tables.
| binary, which corresponds to the FIXED type of Iceberg tables.

nested_type
: struct
| list
| map
Partition Transforms
transform
: identity, support any type. DLC does not support this transformation.
| bucket[N], partition by hashed value mod N buckets, supporting col_type: int,long, decimal, date, timestamp, string, binary
| truncate[L], partition by value truncated to L, supporting col_type: int,long,decimal,string
| years, partition by year, supporting col_type: date,timestamp
| months, partition by month, supporting col_type: date,timestamp
| days/date, partition by date, supporting col_type: date,timestamp
| hours/date_hour, partition by hour, supporting col_type: timestamp

Example

CREATE TABLE dempts(
id bigint COMMENT 'id number',
num int,
eno float,
dno double,
cno decimal(9,3),
flag boolean,
data string,
ts_year timestamp,
date_month date,
bno binary,
point struct<x: double, y: double>,
points array<struct<x: double, y: double>>,
pointmaps map<struct<x: int>, struct<a: int>>
)
COMMENT 'table documentation'
PARTITIONED BY (bucket(16,id), years(ts_year), months(date_month), identity(bno), bucket(3,num), truncate(10,data));


Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback