Query optimization for distributed Database Systems
Calcite is a set of open source query engine, many open source projects use the open source project, especially the use of its optimizer part, similar to drill, Hive, flink use calcite as its optimization engine.
Calcite realized that two sets of Planner,hepplanner and Volcanoplanner,hepplanner are mainly a kind of greedy planner, and volcano is a kind of heuristic planner, Below describes the Hepplanner of the source code implementation. 1 Hepplanner Introduction
Hepplanner is a set of greedy ways of planner, which can be considered to be the best, that is, any rule as long as the hit run, it is considered that the result is more excellent.
Example: calcite implementation of projectfiltertransposerule, the main function is to Exchange project and filter.
SubTree1:
Project ($ $)
filter ($ > 1)
-->rule apply to
SubTree2:
filter ($ > 1)
Project ($, $)
Then Hepplanner will take subTree2 as the better plan. 2 principle of implementation
The principle of realization is mainly divided into several parts to introduce:
1) Heprelvertex
2 Graph & Vertex
3) Hepprogram
4) Rule Apply 2.1 Heprelvertex
Heprelvertex is a simple encapsulation of relational algebraic expression Relnode. All nodes of the Hepplanner are Heprelvertex, and each Heprelvertex points to a true relnode node.
Such as
Project ($, $)
Tablescan ($0,$1,$2)
The Hepplanner will be encapsulated into
Heprelvertex#1->currentrel ($, $)
Heprelvertex#2->currentrel (Tablescan ($0,$1,$2))
So overall, the entire Relnode tree is made up of HEPRELVERTEX nodes, but it points to an internal relnode node. 2.2 Graph & Vertex
Hepplanner in converting all relnode tree to Heprelvertex, a graph is constructed to use vertex representations of all relnode node relationships. First, the relationship between the source->dest that departs from itself to the child node to which it is directed is constructed based on each heprelvertex.
namely Directedgraph
Heprelvertex#1->currentrel ($, $)
Heprelvertex#2->currentrel (Tablescan ($0,$1,$2))
There is a map
Heprelvertex#1-> (heprelvertex#1-> heprelvertex#2)
heprelvertex#2-> () --NULL
All of these mappings are treated as graph of the whole plan tree. 2.3 Hepprogram
Hepplanner in order to better use the greedy algorithm, each run rule is assembled using Hepprogram. The Hepprogram is composed of various hepinstruction, followed by hepinstruction to Hepprogram sequence, hepinstruction, Matchlimit, Rules, where matchlimit represents the limit of the number of times the hepprogram is optimized, or infinity if not set, and Matchorder represents the sequence of each rule execution, including arbitrary, bottom_up, Top_ Down three ways, where arbitrary is considered the most efficient way to apply and the default way.
1. Arbitrary
The meaning is rule every time the apply is started from the current relnode node until no vertex is executed, the root node is applied again.
2. bottom_up
Meaning is for all vertex, follow the reverse method apply.
3. Top_down
In contrast to bottom_up, all vertex are executed sequentially, which can be considered from top to down.
4. Depth_first
I ci a new apply mode, that is, depth first. In order to solve each rule apply, it is too many times since the root node has been added after each new plan. It also serves as the default for Hepplanner, because it is more efficient.
See https://issues.apache.org/jira/browse/CALCITE-2111
Rules indicate the tuning rule that needs to be run.
Hepplanner run Hepprogram algorithm is as follows
for (Hepinstruction instruction:currentProgram.instructions) {
Instruction.execute (this);
int delta = NTRANSFORMATIONS-NTRANSFORMATIONSLASTGC;
if (Delta > GRAPHSIZELASTGC) {
The number of transformations performed since the last
Garbage collection is greater than the number of vertices in
The graph at this time. That means there should is a
Reasonable amount of garbage to collect now. We Do it
Way to amortize garbage collection cost over multiple
Instructions, while keeping the highwater memory usage
Proportional to the graph size.
CollectGarbage ();
}
}
You can see how simple it is to perform each hepinstruction sequentially, and then, when the number of transformation occurrences reaches a certain level, clear some of the relnode of intermediate results.2.4 Rule Apply
How is the rules for a set handled each time?
The algorithm is as follows:
int nmatches = 0;
Boolean fixpoint;
do {
IteratorFixpoint = true;
while (Iter.hasnext ()) {
Heprelvertex vertex = Iter.next ();
for (Reloptrule rule:rules) {
Heprelvertex Newvertex =
Applyrule (rule, vertex, forceconversions);
if (Newvertex!= null) {
++nmatches;
if (nmatches >= currentprogram.matchlimit) {
Return
}
if (fullrestartaftertransformation) {
iter = Getgraphiterator (root);
} else {
To the extent possible, pick up where we left
Off Have to create a new iterator because
One is invalidated by transformation.
iter = Getgraphiterator (Newvertex);
Remember to go around again since we ' re
Skipping some stuff.
Fixpoint = false;
}
Break
}
}
}
while (!fixpoint);
The algorithm is relatively concise, first build the entire tree based on the root Heprelvertex iterator, traversing each heprelvertex, while with the given rules for match and apply, if the rule produced more excellent result, That occurs transformation, and then continues to build the Heprelvertex iterator in accordance with Matchorder, thus continuing to apply, the direct whole tree no longer has transformatin generation or reached the limit of matchlimit.3 Advantages and disadvantages
Hepplanner seems simple, but in fact, a lot of holes, that is, the shortcomings are also obvious.
1) plan is not optimal
Each time the rule produces the result as the beginning of the next iteration and is considered to be the optimal plan, the other optimizations may be lost, so plan may not be optimal.
2) to achieve shortcomings
Dead loop: Each rule apply will produce a result, if a rule generated pattern will continue to apply the same rule will always apply. This is another use of matchlimit.
There are roughly three types of recurring apply situations here:
1 The internal implementation of a single rule triggers repeated apply, such as Joincommuterule, which implements the join type Swap,left->right, right->left, if not restricted, will continue to apply.
2 multiple rule combinations trigger repeated apply, such as Projectfiltertransposerule and Filterprojecttransposerule.
such as Sql:select 1 from Dept where ABS (-1) =20
3 A single rule will always produce repeated relnode.
No similar rule has been found in calcite, but when used, we wrote a similar case.
such as filterxxxtransposerule, for or processing, extract the expression, the remainder of the same as the original filter, resulting in subsequent generation of filter will continue to rule apply.
Apply times are not excellent: after each rule apply, rebuild the next iteration, repeat the apply more times. 4 Use
Hepplanner in the calcite is mainly used in advance pruning, convenient to provide Volcanoplanner a suboptimal plan, continue to optimize, such as drill and so on are processed.
PS: Follow-up will continue to update the Volcanoplanner source implementation