Optimization process
Pig philosophy of the second--pigs Are domestic Animals. The user has sufficient control. Specific to the optimization of the logical execution plan, users can choose the appropriate optimization rules according to their own situation (also can be understood as the optimization of this piece has great potential to dig).
The logical execution plan is logicalplanoptimizer processed and matched with a series of optimization rules before being compiled into a physical execution plan, and the matching optimization rules transform the original execution plan, resulting in the optimized new execution plan. The whole process is as shown in the figure:
Pig's logical optimizer achieves optimization by simplifying, merging, inserting, and adjusting the order of logicalrelationaloperator in the logical execution plan. Each optimization rule is described below.
Rule-based Optimizer
Partitionfilteroptimizer
Push partition filter condition to loader (require loader support, such as Hcatloader support partition field push, please refer to the Loadmetadata interface described earlier)
Filterlogicexpressionsimplifier
Simplify the logical conditional expression in the filter statement, where the rules are more and delegated to logicalexpressionproxy for processing: constant computation, conversion of and/or operations according to Morgan Law, and use of DNF standardized logic formulas.
Splitfilter
Splits the conditions in the filter statement so that they are pushed down separately. Like what:
A = LOAD ' input1 ' As (a0, A1);
B = LOAD ' Input2 ' as (B0, B1);
C = JOIN A by A0 and B by B0;
D = FILTER C by a1>0 and b1>0;
The filter conditions for a and B in D can be separated so that the two filter conditions can be pushed down separately.
X = FILTER C by a1>0;
D = FILTER X by b1>0;
Pushupfilter
Push the filter condition down (push along the data stream dag graph), reduce the data transfer amount
Filteraboveforeach
To remove a filter condition that repeats with a previous operation from a foreach statement
Implicitsplitinserter
Insert a split statement (for more details, see the split section in "Other optimizations" below)
Mergefilter
After Pushupfilter, merge filter conditions to reduce filter statements
Pushdownforeachflatten
Putting the flatten in a foreach backward (pushed down the data flow dag graph) reduces the amount of data for subsequent join operations. Because if flatten to the bag operation, a record generates multiple records, reducing the performance of subsequent join operations, and after optimization, the flatten action is placed after the join operation.
Limitoptimizer
The limit statement pushes down and reduces the amount of data transfer as early as possible.