Spark Sql/catalyst internal principles and RBO

Source: Internet
Author: User

Original article, please be sure to place the following paragraph at the beginning of the article.
This article forwards from the technical World , the original link

The contents of this article are based on the latest release of Spark 2.3.1 of September 10, 2018. Subsequent updates will continue

Spark SQL Schema

The overall architecture of Spark SQL is as shown

From the visible, whether directly using SQL statements or using DataFrame, the following steps are converted to the operation of the RDD with the DAG

    • Parser parsing SQL, generating unresolved Logical Plan
    • Resolved Logical Plan generated by Analyzer combined with Catalog information
    • Optimizer optimizes resolved Logical plan based on predefined rules and generates Optimized Logical plan
    • Query Planner convert Optimized Logical plan to multiple physical plan
    • The CBO calculates the cost of each physical plan and chooses the least expensive physical plan as the final physical plan, based on the costing Model.
    • Spark performs the above physical Plan with the DAG method
    • During the execution of the DAG, Adaptive execution dynamically adjusts the execution plan to improve execution efficiency based on runtime information

Spark SQL uses ANTLR for notation and parsing, and generates Unresolvedplan.

When a user submits SQL using Sparksession.sql (sqltext:string), Sparksession eventually calls the Parseplan method of Sparksqlparser. The method is divided into two steps

    • Use ANTLR generated sqlbaselexer to perform lexical analysis of SQL and generate Commontokenstream
    • Using ANTLR generated sqlbaseparser for parsing, get Logicalplan

Now two tables, respectively, are defined as follows

CREATE TABLE score (  id INT,  math_score INT,  english_score INT)
CREATE TABLE people (  id INT,  age INT,  name INT)

The query is associated with the following

SELECT sum(v)FROM (  SELECT,    100 + 80 + score.math_score + score.english_score AS v  FROM people  JOIN score  ON =  AND people.age > 10) tmp

The generated Unresolvedplan is as shown.

From visible

    • The two tables involved in the query were parsed into two unresolvedrelation, that is, they only know that they are two tables, but they do not know whether they are EXTERNAL table or MANAGED table, they do not know where their data exists, and do not know how their table structure
    • The result of sum (v) is not named
    • The Project section only knows that the attribute is selected, but does not know which table the properties belong to, and it does not know its data type
    • The Filter section also does not know the data type

The unresolvedplan that Spark SQL resolves is shown below

== Parsed Logical Plan =='Project [unresolvedalias('sum('v), None)]+- 'SubqueryAlias tmp   +- 'Project [', (((100 + 80) + 'score.math_score) + 'score.english_score) AS v#493]      +- 'Filter ((' = ' && ('people.age > 10))         +- 'Join Inner            :- 'UnresolvedRelation `people`            +- 'UnresolvedRelation `score`

Visible from the construction method of Analyzer

    • Analyzer holds a reference to a Sessioncatalog object
    • Analyzer inherits from Ruleexecutor[logicalplan], so Logicalplan can be converted
class Analyzer(    catalog: SessionCatalog,    conf: SQLConf,    maxIterations: Int)  extends RuleExecutor[LogicalPlan] with CheckAnalysis {

Analyzer contains the following conversion rules

Lazy val Batches:seq[batch] = Seq (Batch ("Hints", FixedPoint, New resolvehints.resolvebroadcasthints (conf), resolvehints.removeallhints), Batch ("Simple Sanity Check", Once, lookupfunctions), batch ("Substitution", FIXEDP    Oint, Ctesubstitution, Windowssubstitution, Eliminateunions, New Substituteunresolvedordinals (conf)),      Batch ("Resolution", FixedPoint, Resolvetablevaluedfunctions:: resolverelations:: resolvereferences:: Resolvecreatenamedstruct:: Resolvedeserializer:: resolvenewinstance:: resolveupcast:: Resolve      Groupinganalytics:: Resolvepivot:: Resolveordinalinorderbyandgroupby:: resolveaggaliasingroupby:: Resolvemissingreferences:: Extractgenerator:: resolvegenerate:: resolvefunctions:: resolvealiases      :: Resolvesubquery:: resolvesubquerycolumnaliases:: Resolvewindoworder:: resolvewindowframe:: ResolvenaturalandusingjOin:: extractwindowexpressions:: globalaggregates:: resolveaggregatefunctions:: timewindowing:: Resolveinlinetables (conf):: Resolvetimezone (conf):: resolveduuidexpressions:: Typecoercion.typecoerc    Ionrules (conf) + + Extendedresolutionrules: _*), Batch ("Post-hoc Resolution", Once, Posthocresolutionrules: _*), Batch ("View", Once, Aliasviewchild (conf)), batch ("Nondeterministic", Once, pulloutnondeterministic), Bat CH ("UDF", Once, handlenullinputsforudf), batch ("Fixnullability", Once, fixnullability), batch ("subquery", Once, Updateouterreferences), Batch ("Cleanup", FixedPoint, cleanupaliases))

For example, Resolverelations is used to analyze the Table or View used by the query. In this example unresolvedrelation (people) and Unresolvedrelation (score) are resolved to Hivetablerelation (Json.people) and HiveTableRelation (JS On.score), and lists the names of the fields that each contain.

The resolved Logical Plan obtained by analyzer analysis is as follows

== Analyzed Logical Plan ==sum(v): bigintAggregate [sum(cast(v#493 as bigint)) AS sum(v)#504L]+- SubqueryAlias tmp   +- Project [id#500, (((100 + 80) + math_score#501) + english_score#502) AS v#493]      +- Filter ((id#496 = id#500) && (age#497 > 10))         +- Join Inner            :- SubqueryAlias people            :  +- HiveTableRelation `jason`.`people`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#496, age#497, name#498]            +- SubqueryAlias score               +- HiveTableRelation `jason`.`score`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#500, math_score#501, english_score#502]

Before and after analyzer analysis, the Logicalplan comparison is as follows

By visible, after analysis, each table corresponds to the set of fields, the field type, and the data storage location are determined. The field types for Project and Filter operations as well as the location in the table have also been determined.

With this information, it is already possible to convert the Logicalplan directly to the physical Plan for execution.

However, due to the different quality of SQL submitted by different users, direct execution will cause different user-submitted semantics of the same SQL execution efficiency gap is very far. In other words, if you want to ensure high execution efficiency, users need to do a lot of SQL optimization, the use of experience greatly reduced.

To make the best possible effort to ensure that users are familiar with SQL optimization, the quality of the submitted SQL, Spark SQL can be executed at a high efficiency and logicalplan optimized before execution.


The current optimizations for Spark SQL are mostly rule-based optimizations, i.e. RBO (rule-based optimization)

    • Each optimization exists in the form of rule, and each rule is an equivalent conversion to the Analyzed Plan
    • The RBO is well designed and easy to extend, and new rules can be easily embedded into the Optimizer
    • RBO is good enough now, but still needs more rules to cover more scenes.
    • The optimization idea is mainly to reduce the amount of data involved in the calculation and the cost of computing itself

Pushdownpredicate is the most common method for reducing the amount of data involved in calculations.

In this article, the Join operation is performed directly on the two tables before the Filter operation. After introducing Pushdownpredicate, you can Filter the two tables and then Join them, as shown in.

When filter filters out most of the data, the amount of data involved in the join is greatly reduced, which makes the join operation much faster.

It is to be explained here that the optimization here is the optimization of the Logicalplan, which logically guarantees that the Filter is pushed backwards due to the small amount of data involved in the join and improves performance. On the other hand, at the physical level, the filter is pushed back, for the support of the filter push Storage, do not need to scan the full amount of data to filter the table, but directly scan the data according to filter conditions, thus greatly reducing the cost of scanning table at the physical level, improve the execution speed.

In the SQL query for this article, the Project section contains the + + Match_score + english_score. If you do not optimize, then if there are 100 million records, it will calculate 100 million times 100 + 80, very waste of resources. These constants can therefore be combined by constantfolding to reduce unnecessary computations and improve execution speed.

In, the filter and Join operations retain all the fields on both sides, and then filter out the specific columns that you want in the Project operation. If you can push Project down, only the minimum field set that satisfies subsequent actions is filtered when the table is scanned, which can greatly reduce the amount of intermediate result set data for the filter and project operations, thus greatly improving the execution speed.

It is to be explained here that the optimizations here are logical optimizations. Physically, Project is pushed backwards, and for columnstore, such as parquet and ORC, you can scan a table to scan only the columns you need and skip the columns you don't need, further reducing the scanning overhead and increasing execution speed.

After the optimization of the Logicalplan as follows

== Optimized Logical Plan ==Aggregate [sum(cast(v#493 as bigint)) AS sum(v)#504L]+- Project [((180 + math_score#501) + english_score#502) AS v#493]   +- Join Inner, (id#496 = id#500)      :- Project [id#496]      :  +- Filter ((isnotnull(age#497) && (age#497 > 10)) && isnotnull(id#496))      :     +- HiveTableRelation `jason`.`people`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#496, age#497, name#498]      +- Filter isnotnull(id#500)         +- HiveTableRelation `jason`.`score`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#500, math_score#501, english_score#502]

After the optimized Logicalplan, Sparkplanner translates it into a sparkplan-physical plan.

In this case, Spark uses broadcastjoin because of the small amount of score table data. So the score table passes the Filter and uses the Broadcastexchangeexec to broadcast the data directly, and then joins the People table using BROADCASTHASHJOINEXEC with the broadcast data. Use Hashaggregateexec to group aggregations after Project.

At this point, a complete process of SQL from submission to parsing, analysis, optimization, and execution is completed.

This article describes the Optimizer belongs to the RBO, the realization is simple and effective. It belongs to the optimization of Logicalplan, and all optimizations are based on the characteristics of Logicalplan itself, without regard to the characteristics of the data itself, nor the cost of the operator itself. The CBO is described below, which takes into account the characteristics of the data itself (such as size, distribution) and the characteristics of the operator (the distribution and size of the intermediate result set) and the cost, thus better choosing the physical execution plan with minimal execution cost, i.e. Sparkplan.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: