Organize your understanding of spark SQL

Source: Internet
Author: User

Catalyst


Catalyst is a separate library that is decoupled from spark and is a framework for generating and optimizing impl-free execution plans.

Currently coupled with Spark core, there are some questions about this in the user mail group, see Mail.

The following is a catalyst earlier architecture diagram that shows the structure of the code and the process flow.


Catalyst Positioning

Other systems if you want to do some query based on spark, such as SQL, standard SQL or even other query language, we need to implement the parsing, generating, optimizing and mapping of execution plan based on the parser provided by catalyst, the execution plan tree structure, the processing rule system of logical execution plan, etc.

In correspondence, the class structure involved in the treenodelib of the left and the intermediate three conversions is provided by Catalyst. As for the right-hand physical execution plan mapping generation process, the physical execution plan is based on the cost-optimized model, and the execution of the physical operator is implemented by the system itself.

Catalyst Status

What is provided in the parser is a simple Scala-written SQL parser, with limited semantic support and should be standard SQL.

In terms of rules, the provision of optimization rules is relatively basic (and pig/hive is not so rich), but some of the optimization rules are actually related to the specific physical operators, so some of the rules need to be in the system of their own design and implementation (such as the Sparkstrategy in Spark-sql).

Catalyst also has its own set of data types.


The following are some of the important class structures in the catalyst.


TreeNode System

TreeNode is the data structure represented by the catalyst execution plan and is a tree structure with some Scala collection capabilities and tree traversal capability. This tree has been maintained in memory, not dump to disk in a certain format of the file exists, and whether in the mapping of the logical execution Plan phase or the optimization of the logical execution Plan phase, the tree is modified to replace the existing node in the way.

TreeNode, internal with a children:seq[basetype] represents a child node, with a foreach, map, collect and other methods for node operations, as well as Transformdown (default, pre-sequence traversal), Transformup such a method of traversing a tree node to implement a change to a matching node.

Provide Unarynode,binarynode, leafnode three kinds of trait, that is, non-leaf nodes are allowed to have one or two child nodes.

The TreeNode offers a model.

TreeNode has two sub-class inheritance systems, Queryplan and expression. Queryplan The following is a logical and physical execution plan two systems, the former in the catalyst has a detailed implementation, the latter needs to be implemented in the system itself. Expression is a system of expressions and is described in later chapters.


Tree's transformation implementation:

Incoming Partialfunction[treetype,treetype], if the operator is matched, the node will be replaced by the result, otherwise the node will not change. The whole process is performed on the children recursively.


Execution Plan Presentation Model


Logical Execution Plan

Queryplan inherit from TreeNode, internal with a output:seq[attribute], with Transformexpressiondown, Transformexpressionup method.

In Catalyst, the primary subclass of Queryplan is Logicalplan, the logical execution Plan representation. Its physical execution plan is represented by the consumer implementation (in the Spark-sql project).

Logicalplan inherit from Queryplan, internal with a reference:set[attribute], the Main method is resolve (name:string): Option[namedeexpression], Used to analyze the generation of the corresponding namedexpression.

Logicalplan has a lot of specific subclasses, also divided into Unarynode, Binarynode, leafnode three categories, specifically under the org.apache.spark.sql.catalyst.plans.logical path.




Logical Execution Plan Implementation

Leafnode Primary subclass is the command system:


The semantics of each command can be seen from the subclass name, which represents the Non-query commands that the system can execute, such as DDL.

Subclass of Unarynode:



Subclass of Binarynode:



Physical execution Plan

On the other hand, the physical execution Plan node is implemented in a specific system, such as the Sparkplan inheritance system in Spark-sql engineering.



Physical Execution Plan implementation

Each subclass implements the Execute () method, with roughly the following implementation subclasses (incomplete).

Subclass of Leadnode:



Subclass of Unarynode:



Subclass of Binarynode:


Refer to the physical execution plan and mention the partition representation model provided by Catalyst.


Execution Plan Mapping

Catalyst also provides a queryplanner[physical <: Treenode[physicalplan]] abstract class, requiring subclasses to make a batch of strategies:seq[strategy], The Apply method is similar to mapping the logical execution plan operator to a physical execution plan operator based on the specific strategy that is developed. As the nodes of the physical execution plan are implemented in the concrete system, the queryplanner and the inside strategies also need to be implemented in the concrete system.



In the Spark-sql project, Sparkstrategies inherited Queryplanner[sparkplan], internally developed Leftsemijoin, Hashjoin,partialaggregation, Broadcastnestedloopjoin, Cartesianproduct and several other strategies, each strategy accepts a Logicalplan, generates Seq[sparkplan], Each Sparkplan is understood as an operator operation for a specific rdd.

For example, in the Basicoperators this strategy, a number of basic operators are processed in a match-case-matching manner (one-way direct mapping to the RDD operator), as follows:

Case logical. Project (projectlist, child) =  execution. Project (Projectlist, Planlater (Child)):: Nilcase logical. Filter (condition, child) =  execution. Filter (condition, planlater (Child)):: Nilcase logical. Aggregate (group, agg, child) =  execution. Aggregate (partial = False, group, Agg, Planlater (Child)) (SqlContext):: Nilcase logical. Sample (fraction, withreplacement, seed, child) =  execution. Sample (fraction, withreplacement, seed, planlater (Child)):: Nil

Expression system


expression, expressions, refers to nodes that can be directly computed or processed without performing engine calculations, including cast operations, projection operations, arithmetic, logical operator operations, and so on.

Refer to the class under Org.apache.spark.sql.expressionspackage for details.


Rules System

All need to deal with the execution plan tree (analyze process, optimize process, sparkstrategy process), implement rule matching and node processing, all need to inherit Ruleexecutor[treetype] abstract class.

Ruleexecutor internally provides a seq[batch], which defines the processing steps of the ruleexecutor. Each batch represents a set of rules with a policy that describes the number of iterations (one or more times).

Protected Case Class Batch (name:string, Strategy:strategy, rules:rule[treetype]*)

Rule[treetype <: Treenode[_]] is an abstract class that requires a replication of the Apply (Plan:treetype) method to formulate the processing logic.

Ruleexecutor Apply (Plan:treetype): The Treetype method iterates through the nodes in the incoming plan according to the batches order and the order of rules in batch, and the processing logic is implemented by the specific rule subclass.


Hive-related


Hive Support Mode

Spark SQL support for hive is a separate spark-hive project, and support for hive includes HQL queries, hive Metastore information, hive SerDes, Hive Udfs/udafs/udtfs, and similar shark.

only datasets obtained through the Hive API under Hivecontext can be queried using HQL. The parsing of its HQL relies on the parse method of the Org.apache.hadoop.hive.ql.parse.ParseDriver class to generate the Hive AST.

In fact, SQL and HQL are not supported together. It can be understood that HQL is supported independently, and datasets that can be queried by HQL must be read from the Hive API. Other file support in parquet, JSON, and other files only occurs in the SQL Environment (SqlContext).



Hive on Spark

Hive officially proposed the Jira of Hive Onspark. After the shark is finished, it is split into two directions:



From here, the compatible support for Hive will be transferred to Hive on Spark, and prior shark experience will be reflected in this support of the hive community. I understand that the type of hive support currently in Sparksql is just for the purpose of integrating and manipulating hive data in the spark environment, and its hql execution is to invoke the Hive client driver, which runs on the Hadoop Mr, which is not itself the implementation of hive on spark. Just to manipulate the hive data set indirectly using the RDD.

So if you want to migrate an existing hive task to spark, you should use shark or wait for hive on spark.

The hive support in Spark SQL is not the implementation of hive on spark, but more like a client that reads and writes hive data. And its HQL support contains only hive data, which is independent of the SQL environment.

The above two sections are the differences and understandings of spark SQL Hive, Shark, Hive on Spark.


SQL Core

The core of Spark SQL is to bring the existing RDD with the schema information, and then register it as a SQL-like "Table" for SQL queries. This is mainly divided into two parts, one is to generate Schemard, and the other is to execute the query.


Generate Schemardd

If it is a spark-hive project, then read the metadata information as schema, read the data on the HDFS process to the hive to complete, and then generate SCHEMARDD based on the two parts, HQL () query under Hivecontext.

For Spark SQL,

In terms of data, the RDD can come from any existing rdd, or from a supported third-party format, such as JSON file, parquet file.

SqlContext will implicitly convert the RDD with case class to Schemardd.

Implicit def createschemardd[a <: Product:typetag] (Rdd:rdd[a]) =new Schemardd (This,sparklogicalplan ( Existingrdd.fromproductrdd (RDD)))

The Exsitingrdd will reflect the case class attributes, and convert the RDD data into Catalyst Genericrow, and finally return to Rdd[row], which is a schemardd. The specific conversion logic here can refer to Exsitingrdd's Producttorowrdd and Converttocatalyst methods.

You can then perform the registered table operation provided by Schemardd, part of the RDD conversion action for schema replication, DSL operations, SaveAs operations, and so on.

Row and Genericrow are line representation models in the catalyst

Row uses Seq[any] to indicate that Values,genericrow is a subclass of row, with arrays representing values. Row support data types include int, Long, Double, Float, Boolean, short, Byte, String. Supports reading a column's value by ordinal (ordinal). Isnullat (I:int) should be judged before reading.

Each has a mutable class that provides setxxx (I:int, Value:any) to modify the value on an ordinal.


Hierarchical structure


The difference between the Pig,spark Sql,shark in the realization level is roughly compared, and only the reference is made.





Query process

A parsing and execution process for SQL in SqlContext:

1. The first step Parsesql (sql:string), simple SQL parser to do lexical parsing, generate Logicalplan.

2. The second step, Analyzer (Logicalplan), to finish the lexical parsing of the execution plan for preliminary analysis and mapping,

The current analyzer in SqlContext is provided by catalyst and is defined as follows:

New Analyzer (catalog, emptyfunctionregistry, casesensitive =true)

Catalog for Simplecatalog,catalog is used to register table and query relation.

The functionregistry does not support the Lookupfunction method, so the analyzer does not support function registration , or UDF.

There are several batches of rules defined within Analyzer:

  Val Batches:seq[batch] = Seq (    batch ("Multiinstancerelations", Once,      newrelationinstances),    batch (" Caseinsensitiveattributereferences ", Once,      (if (casesensitive) nil else lowercaseattributereferences:: nil): _*),    Batch ("Resolution", FixedPoint,      resolvereferences::      resolverelations::      newrelationinstances::      implicitgenerate::      Starexpansion::      resolvefunctions::      globalaggregates::      typecoercionrules: _*),    Batch ("Check Analysis ", Once,      checkresolution),    Batch (" Analysisoperators ", FixedPoint,      eliminateanalysisoperators)  )

3. The second step is the preliminary Logicalplan, and the next third step is optimizer (plan).

Optimizer also defines several batches of rules that optimize the execution plan in order.

  Val batches =    Batch ("Combine Limits", FixedPoint (+),      combinelimits)::    Batch ("Constantfolding", FixedPoint (+),      nullpropagation,      constantfolding,      likesimplification,      booleansimplification,      Simplifyfilters,      Simplifycasts,      simplifycaseconversionexpressions)::    Batch ("Filter pushdown", FixedPoint (+),      Combinefilters,      pushpredicatethroughproject,      pushpredicatethroughjoin,      columnpruning):: Nil

4. The optimized execution plan is also lost to Sparkplanner processing, which defines strategies to generate the final physical execution plan tree that can be executed according to the logical execution plan tree, which is the Sparkplan.

    Val Strategies:seq[strategy] =      commandstrategy (self)::      takeordered::      partialaggregation::      Leftsemijoin::      hashjoin::      inmemoryscans::      parquetoperations::      basicoperators:      : Cartesianproduct::      broadcastnestedloopjoin:: Nil

5. Before the final actual implementation of the physical implementation plan, the final two rules, sqlcontext definition of the process is called prepareforexecution, this step is added, direct new Ruleexecutor[sparkplan].

    Val batches =      Batch ("Add Exchange", Once, Addexchange (self))::      Batch ("Prepare Expressions", Once, new Bindreferences[sparkplan]):: Nil

6. Finally call Sparkplan's execute () to perform the calculation. This execute () is defined in the implementation of each Sparkplan, and will generally invoke the Execute () method of children recursively, so it triggers the calculation of the entire tree.


Other features
Memory Column Storage

The Columnstore module is called when Cache/uncache table is sqlcontext.

This module is used for reference from shark, in order to compress the table data when the cache is in memory for the row to column operation.

Implementation class

The Inmemorycolumnartablescan class is the implementation of the Sparkplan Leafnode, which is a physical execution plan. Pass in a Sparkplan (confirmed physical execution meter) and a sequence of attributes that contain a row to column, trigger the calculation and cache the process (and is lazy).

Columnbuilder for different data types (Boolean, Byte, double, float, int, long, short, string) the data is written to Bytebuffer by different subclasses, that is, each field that wraps the row, Generate columns. The corresponding columnaccessor is to access the column and turn it back to row.

Compressiblecolumnbuilder and Compressiblecolumnaccessor are columns with compression builder, whose bytebuffer internal storage structure is as follows

*    .---------------------------Column Type ID (4 bytes) *    |   . -----------------------Null Count N (4 bytes) *    |   |   . -------------------NULL positions (4 x N bytes, empty if Null count is zero) *    |   |   |     . -------------Compression Scheme ID (4 bytes) *    |   |   |     |   ---------Compressed non-null elements * V v v v v   *   +---+---+-----+---+---------+ *   |   |   | ... |   | ... ... | *  +---+---+-----+---+---------+ *  \-----------/\-----------/*       Header         Body

The Compressionscheme subclass is a different kind of compression implementation


are implemented in Scala without the help of a third-party library. Different implementations that specify the supported column data types. At build (), each compression is compared, and the compression rate is minimized (if it is still greater than 0.8, it is not compressed).

The estimation logic here comes from the Gathercompressibilitystats method of the subclass implementation.


Cache logic

Cache, you need to first generate the physical execution plan for this cache table.

In the cache process, Inmemorycolumnartablescan does not trigger execution, but generates a Sparklogicalplan with Inmemorycolumnartablescan as the physical execution plan, The plan that coexists into table.

In fact, in the cache, the first to go to the catalog to find the table information and table execution plan, then the execution (execution to the physical execution plan generation), and then put the table back to the catalog to maintain, The execution plan at this time is already the final physical execution plan to be executed. However, the Columner module-related conversions are not triggered at this time.

The real trigger or execute () is the same as the other Sparkplan's execute () method trigger scenario.


Uncache logic

Uncachetable, in addition to deleting the table information in the catalog, also called the Inmemorycolumnartablescan Cachecolumnbuffers method, get the Rdd collection, and the Unpersist () operation is performed. Cachecolumnbuffers mainly did the Rdd each row in each partition field stored in the Columnbuilder.


UDF (temporarily not supported)

such as the former Face SqlContext Analyzer analysis, its functionregistry did not achieve lookupfunction.

In the Spark-hive project, Hivecontext in the realization of functionregistry this trait, in fact, is now Hivefunctionregistry, Implementation logic See ORG.APACHE.SPARK.SQL.HIVE.HIVEUDFS


Parquet Support

Ready to be sorted

http://parquet.io/

Specific Docs and Codes:

Https://github.com/apache/incubator-parquet-format

Https://github.com/apache/incubator-parquet-mr

http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013


JSON support

SqlContext, added the Jsonfile read method, and now see, the code is implemented in the Hadoop textfile read, that is, the JSON file should be on the HDFs. This JSON file is loaded, InputFormat is Textinputformat,key class is the Longwritable,value class is the text, and finally get the value part of the string content, that is rdd[ String].

In addition to Jsonfile, we also support Jsonrdd, examples:

Http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

After the JSON file is read, it is converted to Schemardd. Jsonrdd.inferschema (rdd[string]) has detailed parsing JSON and mapping out the schema of the process, and finally get the JSON Logicalplan.

JSON parsing uses the Fasterxml/jackson-databind library, GitHub address, wiki

Map data into map[string, any]

JSON support enriches the spark SQL data access scenario.


JDBC Support

JDBC Support Branchis under going


SQL92

The current SQL syntax support for Spark SQL is shown in the Sqlparser class. The goal is to support SQL92??

1. For basic applications, SQL Server and Oracle follow the SQL 92 syntax standard.

2. In practice, everyone will exceed the above criteria, using a rich range of custom standard function libraries and grammars provided by each database vendor.

3. sql extension for Microsoft SQL Server is called T-SQL (Transcate sql).

4. Oracle's SQL extension is called Pl-sql.


There is a problem
You can follow up the Community mailing list, follow upready to be sorted。

Http://apache-spark-developers-list.1001551.n3.nabble.com/sparkSQL-thread-safe-td7263.html

Http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html


Summarize
This is done with the implementation of each module of Spark SQL, the code structure, the execution process, and your own understanding of spark SQL.
Understanding where deviations are welcome to exchange discussions:)

Complete the full text:)


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.