Greenplum views the data distribution of the table to adjust the Dk value.

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Log Analysis of ETL background system data is currently underway. You can view the tasks that have been running for a long time, find the jobs that have consumed a long time, and optimize the logic and database layers. this article focuses only on Database optimization (including SQL statement adjustment and greenplum table DK adjustment ). view a job that takes about 30 minutes, find the corresponding source table, and perform the following analysis:

dw=#select gp_segment_id,count(*) from tb_name group by gp_segment_id order by count(*) desc

gp_segment_id   count
----------------------
    65          16655

Note: gp_segment_id is a hidden column in The greenplum table to mark which node the row belongs. it can be seen that the table is only distributed on one node 65 (for node information, see gp_segment_configuration), and my GP has 96 nodes in total, which obviously does not use the GP multi-node computing capability, the Dk value of this table is set incorrectly. therefore, use alter table tb_name set distributed by (col1 ,...) reset the Dk value of the table. then run the preceding statement again to check the number of nodes (whether each node is distributed) and the number of nodes (whether the distribution is balanced ). After the above two observed indicators roughly meet the requirements, please use vacuum full and vacuum analyze to completely recycle space + collect statistics. Capture the source table of the job with a long time and analyze it one by one. The execution duration of the entire task is shortened from 3 hours to 2 hours (mainly because the previous table design is too bad, this leads to such a large amount of optimization space). In the later stage, we will optimize the logic and SQL, and increase the concurrency. This is the king.

For the convenience of statistical analysis, the following two tables and several functions are designed to collect the table distribution and find which tables need to be adjusted again.

-- Create Table "public ". "table_segment_statistics" ("table_name" varchar (200) default null, "segment_count" int4 default null, "table_rows" int8 default null); Create Table "public ". "table_segment_statistics_balance" ("table_name" varchar (200) default null, "segment_id" int4 default null, "segment_count" int8 default null); -- functioncreate or replace function "public ". "analyze_table_dk_balance" (v_schemaname varchar) returns "pg_catalog ". "int4" as $ body $ declare v_tbvarchar (200); v_cur_tb cursor for select schemaname | '. '| tablename from pg_tables where schemaname <> 'information _ scheme' and schemaname <> 'pg _ catalog' and schemaname <> 'gp _ Toolkit 'and tablename not like' % PRT % 'and schemaname = v_schemaname; begin truncate table public. table_segment_statistics; truncate table public. table_segment_statistics_balance; open v_cur_tb; loop fetch v_cur_tb into v_tb; if not found then exit; end if; execute 'insert into public. table_segment_statistics select ''' | v_tb | ''' as table_name, count (*) as segment_id, sum (Num) as table_rows from (select gp_segment_id, count (*) num from '| v_tb | 'group by gp_segment_id) T'; execute 'insert into public. table_segment_statistics_balance select ''' | v_tb | ''' as table_name, gp_segment_id, count (*) as CNT from '| v_tb | 'group by gp_segment_id order by gp_segment_id '; end loop; return 0; end; $ body $ language 'plpgsql' volatile;

The analysis statement is as follows:

-- 96 refers to the greenplum node (my machine is 96) Select * from public. table_segment_statistics where table_rows is not null and segment_count <96 and table_rows> 10000 order by table_rows DESC; -- find the node that exceeds the average value of 10%. The threshold value can be adjusted by yourself: only tables with more than 10 thousand rows are counted. Small tables do not have much analytical significance. "table_name", B. segment_id,. table_rows/. segment_count as reldk, B. segment_countfrom "public ". table_segment_statistics ainner join "public ". table_segment_statistics_balance Bon. "table_name" = B. "table_name" where. "table_name" is not null and. table_rows> tables and ABS (. table_rows/. segment_count-b.segment_count)/(. table_rows/. segment_count)> 0.1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Greenplum views the data distribution of the table to adjust the Dk value.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Greenplum views the data distribution of the table to adjust the Dk value.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support