For the read-only analysis engine, collecting basic statistics on table objects is essential for achieving optimal SQL execution results. Only after statistics collection is completed can the system provide more accurate and efficient execution plan recommendations for user queries.
Statistics Overview
Statistics refer to the features and distribution of data within database objects. They enable the database to estimate execution costs more accurately and optimize execution plan selection. Types of statistics include the number of rows in a table, the number of distinct values (NDV) in a column, and column histograms. Collecting statistics involves scanning database objects and storing the gathered information in the system's data dictionary. In addition, the system maintains a local cache of statistics to improve the optimizer's access efficiency. When you collect and use statistics, the following two points should be noted:
The accuracy of statistics is crucial to the optimizer's decision-making process. Therefore, regularly updating and maintaining statistics is key to ensuring consistent database performance.
Collecting statistics may have some impact on database performance, as it involves scanning and analyzing database objects. Therefore, it is important to balance performance and the accuracy of statistics during the collection process.
In summary, statistics plays a vital role in database management systems by providing key information about database objects. They assist the optimizer in making more informed decisions, thereby enhancing query performance and improving the overall efficiency of the database system.
Note:
Starting from version 2.2410.1.0, random sampling is supported. This feature allows random sampling based on the sampling rate specified by the TableScan operator. It also supports block-level sampling and enables statistics collection according to the configured sampling ratio, significantly reducing the overhead caused by full table scans during the statistics collection process.
Statistics Collection
The read-only analysis engine currently supports two different methods for collecting statistics:
Automatic collection
Starting from version 2.2410.1.0, the read-only analysis engine supports dynamic sampling and automatic statistics collection during the process of loading data into columnar storage. Users do not need to manually collect statistics, as the system automatically generates the relevant statistics for databases and tables.
Manual collection
If you need to manually obtain the latest statistics for a specific table, you can log in to the read-only analysis engine instance using the MySQL client and execute the following statement to collect statistics for the specified table object:
ANALYZE TABLE <table name>;