Spark SQL is one of the newest and most technologically complex components of spark. It supports SQL queries and the new Dataframe API. At the heart of Spark SQL is the Catalyst Optimizer, which uses advanced programming language features, such as Scala's pattern matching and quasiquotes, to build an extensible query optimizer in a novel way. We recently published a paper on Spark SQL, which will appear in Sigmod 2015 (by Davies liu,joseph K Bradley,xiangrui meng,tomer kaftan,michael J. Franklin and Ali Ghodsi co-authored). In this blog post, we recapitulate some of this paper, explaining the internal capabilities of the Catalyst optimizer to achieve a wider range of applications. To implement Spark SQL, we designed a new extensible optimizer catalyst, based on the functional programming architecture in Scala. The catalyst's extensible design has two purposes. First, we want to be able to easily add new optimization technologies and features to spark SQL, especially to address the various issues we encounter when working with big data (for example, semi-structured data and advanced analytics). Second, we want to enable external developers to extend the optimizer-for example, by adding data source-specific rules that can push filtered or aggregated data to an external storage system, or support new data types. Catalyst supports rules-based and cost-based optimization. The core of catalyst is to use a common library spanning tree and manipulate these trees with rules. Based on this framework, several sets of rules are built for relational query processing libraries (such as expressions, logical query plans) and for handling different stages of executing queries: analysis, logical optimization, physical planning, and code generation, and code generation compiles part of the query into Java bytecode. For the latter, Scala feature Quasiquotes is used, which makes it easy to generate code at run time by a composite expression. Finally, Catalyst provides several common extensibility points, including external data sources and user-defined types.
TreeThe primary data type in catalyst is a tree that consists of node objects. Each node has one node type and 0 or more child nodes. The new node type is defined in Scala as a subclass of the TreeNode class. These objects are immutable and can be manipulated using function transformations, as discussed in the next section. A simple example of using a very simple expression language to describe the three node classes:
- Literal (value: INT): Constant value
- Attribute (Name: String): Enter the properties of the row, such as "X"
- Add (left: TreeNode, right: TreeNode): Sum of two expressions.
These classes can be used to build trees; For example, the tree of expression x + (1 + 2) will be represented in the Scala code as follows:
1 Add (Attribute (x), Add (Literal (1), Literal (2))
Rules
You can use rules to manipulate trees, which are functions from one tree to another tree. Although a rule can run arbitrary code on its input tree (because the tree is just a Scala object), the most common approach is to use a set of pattern-matching functions to find and replace sub-trees with specific structures. Pattern matching is an attribute of many functional languages that allow values to be extracted from potentially nested structures of algebraic data types. In catalyst, the tree provides a conversion method that recursively applies a pattern-matching function to all nodes of the tree, converting each pattern match to a result. For example, we can implement a sum between constants??? The rules for the add operation are as follows:
1 tree.transform {2 Case ADD (Literal (C1), Literal (c2)) = Literal (c1+C2)3 }
The tree that applies this to X + (1 + 2) will produce a new tree X + 3. The key here is to use Scala's standard pattern-matching syntax, which can be used to match the types of objects and provide names for the extracted values (here C1 and C2).
The pattern-matching expression passed to the transform is a partial function,?? This means that it only needs to match a subset of all input trees. Catalyst applies the test rule to which parts of the tree are automatically skipped and dropped to the mismatched subtree. This ability means that rules only need to be inferred for a given tree of applicable optimizations, and not for those that do not. As a result, these rules do not need to be modified when new operators are added to the system. Rules (and generic Scala pattern matching) can match multiple schemas in the same transform invocation, which makes it very simple to implement multiple transformations at once.
1 tree.transform {2 Case Add (Literal (C1), Literal (c2)) = Literal (c1+C2)3case Add (left, Literal (0)) = = Left 4 Case ADD (Literal (0), right) = right5 }
In fact, a rule may need to be executed more than once to fully convert the tree. Catalyst forms a batch of rules and executes each batch to a fixed point, which is not changed after the tree has applied its rules. Although rules run to fixed points mean that each rule is simple and self-contained, these rules still produce a large global effect on the tree. In the example above, repeated application rules will continue to fold larger trees, such as (x + 0) + (3 + 3). Another example is that the first batch can parse expressions of all attributes of a specified type, and the second batch can use these types for constant folding. After each batch has been processed, the developer can also conduct a normative check of the new tree (for example, to see all properties for the specified type), which are generally written using recursive matching.
Finally, the rule condition and itself can contain arbitrary Scala code. This makes the catalyst more powerful than the domain-specific language on the optimizer while maintaining a concise feature. Based on experience, the function transformation of immutable trees makes the whole optimizer very easy to infer and debug. Rules also support parallelization in the optimizer, although the feature has not yet exploited this.
using catalyst in Spark SQLThe general tree conversion framework for catalyst is divided into four phases, as follows: (1) parsing resolves a referenced logical plan, (2) Logical plan optimization, (3) A physical plan, (4) code generation used to compile a partial query to generate Java bytecode. During the physical planning phase, the catalyst may generate multiple schedules and compare them based on costs. All other phases are purely rule-based. Each stage uses different types of tree nodes; The catalyst includes a node library for expressions, data types, and logical and physical operators. These phases are as follows:
parsingThe type of column in sales, even if the column name is valid, is unknown before querying the table for sale metadata. If you do not know its type or do not match it to an input table (or alias), then this property is called unresolved. Spark SQL resolves these properties by using catalyst rules and the catalog object that records all table metadata. After you build the unresolved logical plan tree with unbound properties and data types, then execute the following rules: 1, look up name relationship 2 from catalog, map named attributes (such as COL) to subkey 3 of the operator, refer those properties to the same values to give them a unique ID (then encounter such as Col= Can be optimized for COL) 4. Passing and enforcing types through expressions: For example, we cannot know the return type of 1+col until Col is parsed and its subexpression converted to a compatible type. After statistics, the parser's rules are about 1000 lines of code.
Logical Plan OptimizationIn the logical optimization phase, the logical plan applies standard rule-based optimizations. (Cost-based optimization generates multiple plans through rules and then calculates their costs to execute.) These optimizations include constant folding, predicate extrapolation, project clipping, null propagation, Boolean expression simplification, and other rules. In general, adding rules to a variety of situations is straightforward. For example, when we add a decimal type of fixed precision to spark SQL, we want to optimize the aggregation of decimal (for example, Sum and average) in a low-precision manner; only 12 lines of code are required to write a rule that can be found in sum and AVG expressions. They are then converted to an non-scaled 64-bit long, then aggregated, and the result is converted back. The simplified version of this rule can only optimize the sum expression as follows:
1Object DecimalaggregatesextendsRule[logicalplan] {2 /**Maximum number of decimal digits in a Long*/3Val max_long_digits = 184def apply (Plan:logicalplan): Logicalplan = {5 Plan Transformallexpressions {6 CaseSum (e @ decimaltype.expression (prec, scale))7 ifPrec + <= max_long_digits =8Makedecimal (Sum (Unscaledvalue (e)), Prec + 10, scale)}9}
As another example, a rule with 12 lines of code optimizes the like expression to String.startswith or string.contains invocation with a simple regular expression. Using any Scala code in the rules makes these optimizations easy to express, and these rules transcend the pattern matching of the subtree structure.
After statistics, the logic optimization rule has 800 lines of code.
Physical PlanDuring the physical planning phase, spark SQL uses a logical plan to generate one or more physical plans that use the physical operators that match the Spark execution engine. Then use the cost model to select the plan. Currently, cost-based optimization is only used to select the connection algorithm: for a known small relationship, Spark SQL uses the point-to-point broadcast tool in Spark to broadcast the connection. However, the framework supports more in-depth use of cost-based optimizations, because the entire tree can be recursively estimated using rules. Therefore, we intend to implement richer cost-based optimization in the future. The physical plan also performs rule-based physical optimizations, such as merging pipeline projects or filters into one spark map operation. In addition, it can push operations from a logical plan to a data source that supports predicates or projects pushed down. We will describe the APIs for these data sources in a later section. In general, the physical plan rules have approximately 500 lines of code.
code GenerationThe final phase of query optimization involves generating Java bytecode for running on each machine. Since spark SQL often runs on memory datasets where processing is CPU bound, we want spark SQL to support code generation to speed up execution. However, building a code generation engine is often complex, especially with compilers. Catalyst relies on the special features of the Scala language quasiquotes to simplify code generation. Quasiquotes allows you to programmatically build an abstract syntax tree (AST) in the Scala language and then provide it to the Scala compiler at run time to generate bytecode. Use the catalyst to convert the tree representing the SQL expression into the AST of the Scala code to describe the expression, and then compile and run the generated code. As a simple example, the Add, attribute, and literal tree nodes described in section 4.2 can be written as (x + y) +1 expressions. If you do not use code generation, these expressions must traverse the Add, attribute, and literal node tree walks to interpret each row of data. This introduces a number of branching and virtual function calls, which slows down the execution speed. If you are using code generation, you can write a function to convert a particular expression tree into a Scala AST, as follows:
1 def compile (node:node): AST = node Match {2case Literal (value) = q "$value "3case Attribute (name) + Q" Row.get ($name) "4case Add (left, right ) = Q "${compile (left)} + ${compile (right)}"5 }
The strings that begin with Q are quasiquotes, although they look like strings, but they are parsed by the Scala compiler at compile time and represent the AST of their code. Quasiquotes uses the $ notation notation to stitch variables or other AST into them. For example, the text (1) becomes the AST of the 1 Scala expression, and the attribute ("X") becomes Row.get ("X"). Finally, a tree like Add (Literal (1), Attribute ("x")) becomes the AST of a Scala expression like 1 + row.get ("x").
Quasiquotes type checking at compile time to ensure that only the appropriate AST or text is replaced, making them more useful than string joins, and directly generating Scala AST, rather than running the Scala parser at run time. In addition, they are highly composable, because the code generation rules for each node do not need to know how the tree returned by its child nodes is built. Finally, if catalyst lacks an expression-level optimization, the result code is further optimized by the Scala compiler. Displays quasiquotes generated code whose performance is similar to a manually optimized program. We found that Quasiquotes was very close to code generation and found that even new contributors to spark SQL could quickly add rules to new types of expressions. Quasiquotes also works with targets that run on local Java objects: When you access fields from these objects, you can directly access the fields you want, without having to copy the objects into a spark SQL row and use the row accessor method. Finally, it is easy to combine the evaluation of code generation with an interpretation evaluation of an expression that has not generated code, because the compiled Scala code can be used directly in the expression interpreter. The catalyst generator has about 700 lines of code in total. This blog post describes the internal principles of the Catalyst optimizer for spark SQL. This new, simple design enables the spark community to quickly prototype, implement, and extend the engine. You can read the rest of the papers here. You can also find more information about spark SQL in the following sections:
- Spark SQL and DataFrame Programming Guide from Apache Spark
- Data Source API in Spark presentation by Yin Huai
- Introducing Dataframes in Spark for Large scale Data Science by Reynold Xin
- Beyond sql:speeding up Spark with dataframes by Michael Armbrust
English blog See https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html original paper See/http People.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf reprint Please specify since http://www.cnblogs.com/shishanyuan/p/8455786. Html
In-depth study of the Catalyst Optimizer for Spark SQL (original translation)