UPDATE operations (including INSERT VALUES, UPDATE, DELETE, ALTER TABLE, and ADD COLUMN), it will leave garbage data that is no longer used in the system table and the updated table, causing system performance degradation and taking up a lot of disk space. Therefore, you need to regularly monitor the data bloat of the table.ANALYZE collects table statistics in a database and stores the results in the pg_statistic system directory. The query planner uses such statistics to help determine the most efficient execution plan for a query. The statistics contain various information such as the amount of data and indexes in the table. A good query plan is based on accurate table statistics.ANALYZE descriptionANALYZE is a command provided by Greenplum to collect statistics and supports column, table, and database granularities as shown below:CREATE TABLE foo (id int NOT NULL, bar text NOT NULL) DISTRIBUTED BY (id); // Create the test table `foo`ANALYZE foo(bar); // Collect statistics of the `bar` column onlyANALYZE foo; // Collect statistics of the `foo` tableANALYZE; // Collect statistics of all tables in the current database. You need to have the permission to do so
ANALYZEANALYZE will put a SHARE UPDATE EXCLUSIVE lock on the target table, which will conflict with DDL statements.ANALYZE speedANALYZE is a sampling statistical algorithm that usually does not scan all data in a table, but still consumes some time and computing resources for large tables.ANALYZEANALYZE locks tables and consumes system resources, so it is important to run it at the right time and as little as possible. We recommend you run ANALYZE in the following four scenarios.INSERT, UPDATE, and DELETE operations are performed on a large amount of data.VACUUM FULL is executed.optimizer_analyze_root_partition, there is no difference in manipulating a partitioned table. Just run ANALYZE on the root table, and the system will automatically collect the statistics of partitioned tables on all leaf nodes.ANALYZE on the root table can be time-consuming. Partitioned tables are usually time-dimensional, and historical partitioned tables are not modified; therefore, we recommend you run ANALYZE separately in partitions where data changes.gp_toolkit schema provides a gp_bloat_diag view that determines the table bloat by the ratio of the actual number of pages to the expected number of pages. To use this view, make sure that the latest statistics are collected for all tables in the database. Then, run the following SQL:gpadmin=# SELECT * FROM gp_toolkit.gp_bloat_diag;bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages | bdidiag----------+------------+------------+-------------+-------------+---------------------------------------21488 | public | t1 | 97 | 1 | significant amount of bloat suspected(1 row)
bdirelpage is the actual number of pages and bdiexppages is the expected number of pages in the t1 table. A bloat ratio exceeding 4 will be recorded in the table, and there may also be a slight bloat even when no ratio is recorded. You can also compare the space of tables at different times to determine whether data has bloated.VACUUM <tablename> command adds expired rows to the shared free space map, so such space can be reused. When VACUUM is run periodically on a table with frequent UPDATE operations, the space occupied by expired rows can be reused quickly, easing table bloat. The cycle to run VACUUM is determined by the speed of UPDATE and DELETE on the table.VACUUM holds a SHARE UPDATE EXCLUSIVE lock as ANALYZE does and may interlock with DDL operations.VACUUM can only slow down the process but not immediately reclaim the space. Therefore, you need to run VACUUM FULL to immediately reclaim all the bloat space. However, VACUUM FULL will add ACCESS EXCLUSIVE to the manipulated table, during which all other DDL and DML statements on the table will be locked. Proceed with caution when running VACUUM FULL as the operation may be time-consuming for large tables.ALTER TABLE <tablename> SET WITH (REORGANIZE=false)DISTRIBUTED randomly;
ALTER TABLE <table_name> SET WITH (REORGANIZE=true)DISTRIBUTED BY (<original distribution columns>);
VACUUM FULL command will only reclaim spaces from tables. To reclaim spaces from indexes, you need to rebuild them with the REINDEX command.REINDEX TABLE <table_name> --- Rebuild all indexes in the tableREINDEX INDEX <index_name> --- Rebuild a specified index
REINDEX and VACUUM FULL will add ACCESS EXCLUSIVE.VACUUM <tablename> can mark reclaimable space without table locking and slow down data bloat. Cleaning without locking the table will not affect data table reads/writes, but DDL statements cannot be used, and some CPU and memory resources will be used.#!/bin/bashexport PGHOST=xxx.xxx.x.xexport PGPORT=5436export PGUSER=testexport PGPASSWORD=testdb="test"psql -d $db -e -c "VACUUM test_table;"
VACUUM FULL and REINDEX to reclaim all the bloat space in the table. Cleaning with table locking will put an ACCESS EXCLUSIVE lock on the manipulated table, during which you cannot perform any operations on the table.VACUUM FULL <tablename>. REINDEX TABLE <tablename> (you can skip this step for tables without indexes). ANALYZE <tablename> .#!/bin/bashexport PGHOST=xxx.xxx.x.xexport PGPORT=5436export PGUSER=testexport PGPASSWORD=testdb="test"psql -d $db -e -c "VACUUM FULL test_table;"psql -d $db -e -c "REINDEX TABLE test_table;"psql -d $db -e -c "ANALYZE test_table;"
Feedback