Log Analysis of ETL background system data is currently underway. You can view the tasks that have been running for a long time, find the jobs that have consumed a long time, and optimize the logic and database layers. this article focuses only on Database optimization (including SQL statement adjustment and greenplum table DK adjustment ). view a job that takes about 30 minutes, find the corresponding source table, and perform the following analysis:
dw=#select gp_segment_id,count(*) from tb_name group by gp_segment_id order by count(*) desc
gp_segment_id count
----------------------
65 16655
Note: gp_segment_id is a hidden column in The greenplum table to mark which node the row belongs. it can be seen that the table is only distributed on one node 65 (for node information, see gp_segment_configuration), and my GP has 96 nodes in total, which obviously does not use the GP multi-node computing capability, the Dk value of this table is set incorrectly. therefore, use alter table tb_name set distributed by (col1 ,...) reset the Dk value of the table. then run the preceding statement again to check the number of nodes (whether each node is distributed) and the number of nodes (whether the distribution is balanced ). After the above two observed indicators roughly meet the requirements, please use vacuum full and vacuum analyze to completely recycle space + collect statistics. Capture the source table of the job with a long time and analyze it one by one. The execution duration of the entire task is shortened from 3 hours to 2 hours (mainly because the previous table design is too bad, this leads to such a large amount of optimization space). In the later stage, we will optimize the logic and SQL, and increase the concurrency. This is the king.
For the convenience of statistical analysis, the following two tables and several functions are designed to collect the table distribution and find which tables need to be adjusted again.
-- Create Table "public ". "table_segment_statistics" ("table_name" varchar (200) default null, "segment_count" int4 default null, "table_rows" int8 default null); Create Table "public ". "table_segment_statistics_balance" ("table_name" varchar (200) default null, "segment_id" int4 default null, "segment_count" int8 default null); -- functioncreate or replace function "public ". "analyze_table_dk_balance" (v_schemaname varchar) returns "pg_catalog ". "int4" as $ body $ declare v_tbvarchar (200); v_cur_tb cursor for select schemaname | '. '| tablename from pg_tables where schemaname <> 'information _ scheme' and schemaname <> 'pg _ catalog' and schemaname <> 'gp _ Toolkit 'and tablename not like' % PRT % 'and schemaname = v_schemaname; begin truncate table public. table_segment_statistics; truncate table public. table_segment_statistics_balance; open v_cur_tb; loop fetch v_cur_tb into v_tb; if not found then exit; end if; execute 'insert into public. table_segment_statistics select ''' | v_tb | ''' as table_name, count (*) as segment_id, sum (Num) as table_rows from (select gp_segment_id, count (*) num from '| v_tb | 'group by gp_segment_id) T'; execute 'insert into public. table_segment_statistics_balance select ''' | v_tb | ''' as table_name, gp_segment_id, count (*) as CNT from '| v_tb | 'group by gp_segment_id order by gp_segment_id '; end loop; return 0; end; $ body $ language 'plpgsql' volatile;
The analysis statement is as follows:
-- 96 refers to the greenplum node (my machine is 96) Select * from public. table_segment_statistics where table_rows is not null and segment_count <96 and table_rows> 10000 order by table_rows DESC; -- find the node that exceeds the average value of 10%. The threshold value can be adjusted by yourself: only tables with more than 10 thousand rows are counted. Small tables do not have much analytical significance. "table_name", B. segment_id,. table_rows/. segment_count as reldk, B. segment_countfrom "public ". table_segment_statistics ainner join "public ". table_segment_statistics_balance Bon. "table_name" = B. "table_name" where. "table_name" is not null and. table_rows> tables and ABS (. table_rows/. segment_count-b.segment_count)/(. table_rows/. segment_count)> 0.1