The system is not active recently and has not been upgraded. Therefore, you can optimize the entire ETL system on a stable basis. Top 10 cost time jobs are listed on a daily basis for analysis. The top1 costtime job uses the window functions first_value and last_value. The result SQL uses first_value, the window functions are sorted twice. Use the explain Section Code It can be found that the two sort consumption is about 1.7 times that of one sort, the second sort is improved to one, and the SQL is migrated from datastage TO THE greenplum function, the entire process is reduced from 24 minutes to 40 seconds.
Analyze why only the first_value statement is used, and the second sort is used for full use of first_value. The reason is as follows:
Window call: Window window_name (partition by XXX order by XXX)
When first_value and last_value are used, partition and order by will affect the result.
The analysis function contains three analysis clauses: partition by, order by, and window. the rows in the window are as follows: unbounded preceding (first row), current row (current row), unbounded following (last row)
(1) If the statement is (partition by XXX order by XXX), the default window is unbounded preceding and current row.
(2) The statement is (partition by XXX). The window defaults to unbounded preceding and unbounded following.
(3) The statement is (), and the window defaults to all result sets.
A possible problem occurs when the first method is used.
The test environment is as follows:
Create Table windowcheck
(
Oc_date integer,
City varchar (50 ),
Id integer,
Sale integer
);
Select * From windowcheck;
Oc_date | city | ID | sale
---------- + ------ + -------
20120701 | bj| 3299 | 10040
20120701 | Cs | 3210 | 7100
20120701 | NJ | 3300 | 8900
20120701 | NJ | 3301 | 9000
20120701 | TJ | 3303 | 3890
20120701 | wh | 3302 | 4700
Select oc_date, city, ID, sale, last_value (sale) over (partition by oc_date order by city) last_value
From windowcheck;
Oc_date | city | ID | sale | last_value
---------- + ------ + ------- + ------------
20120701 | BJ | 3299 | 10040 | 10040
20120701 | Cs | 3210 | 7100 | 7100
20120701 | NJ | 3301 | 9000 | 8900
20120701 | NJ | 3300 | 8900 | 8900
20120701 | TJ | 3303 | 3890 | 3890
20120701 | wh | 3302 | 4700 | 4700
The problem arises: We partition by oc_date and sort the city. The final value of the result set is wh, and its sale value is 4700. the original idea is that all last_values In the result set should be 4700. Where is the problem? The problem lies in the scope of the previously written window. The default window with partition and order is unbounded preceding and current row. At this time, when reading the first entry, the window range is only one, and its last_value value is 10040. when reading the second entry, the window range is the first entry and the current entry, the value of last_value is 7100, and so on. the solution is to expand the window. The statement is as follows:
Select oc_date, city, ID, sale, last_value (sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
From windowcheck;
Here, the Ambiguity Caused by the size of the window range is solved. Another question is: what if this SQL statement is also sorted when the partition function is used? Because the order of the result set of the SQL statement affects the value of ast_value or first_value. Here we need to analyze the sequence of execution of the entire statement: the analysis function is performed after the end of the entire SQL query (the execution of order by in the SQL statement is special), that is
The order by statement also affects the execution result of the analysis function. The analysis statement is as follows:
-- The order by statement of the SQL statement that generates the result set is consistent with the order by statement in the partition function.
Select oc_date, city, ID, sale, last_value (sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
From windowcheck order by city;
Oc_date | city | ID | sale | last_value
---------- + ------ + ------- + ------------
20120701 | BJ | 3299 | 10040 | 4700
20120701 | Cs | 3210 | 7100 | 4700
20120701 | NJ | 3301 | 9000 | 4700
20120701 | NJ | 3300 | 8900 | 4700
20120701 | TJ | 3303 | 3890 | 4700
20120701 | wh | 3302 | 4700 | 4700
-- The order by statement of the SQL statement that generates the result set is inconsistent with the order by statement in the partition function.
Select oc_date, city, ID, sale, last_value (sale) over (partition by oc_date order by city rows between unbounded preceding and unbounded following) last_value
From windowcheck order by city DESC;
Oc_date | city | ID | sale | last_value
---------- + ------ + ------- + ------------
20120701 | wh | 3302 | 4700 | 4700
20120701 | TJ | 3303 | 3890 | 4700
20120701 | NJ | 3301 | 9000 | 4700
20120701 | NJ | 3300 | 8900 | 4700
20120701 | Cs | 3210 | 7100 | 4700
20120701 | BJ | 3299 | 10040 | 4700
The preceding two statements show that the order by statement in the SQL statement that generates the result set does not affect the results in the partition function, the reason is that if the order by clause in the result set is inconsistent with that in the partition function, the order by clause in the partition function is used to calculate the result first, and then the order by clause in the result set is executed. if they are consistent, the partition function does not need to be sorted during analysis.
Use Window in GP
Select oc_date, city, ID, sale, last_value (sale) over (w) last_value
From windowcheck
Where oc_date = 20120701
Window w as (partition by oc_date order by city rows between unbounded preceding and unbounded following );