First: The core process of Spark SQL source analysis

Source: Internet
Author: User
Tags prepare

/** Spark SQL Source Analysis series Article */

Since last year, Spark's Submit Michael Armbrust shared his catalyst, more than 1 years, spark SQL contributor from a few people to dozens of people, and the development speed is extremely rapid, the reason, personally think there are the following 2 points:

1. Integration: The SQL Type Query language is integrated into Spark's core RDD concept. This can be applied to a variety of tasks, stream processing, batch processing, including machine learning can be introduced in SQL.
2. Efficiency: Because the shark is constrained by the hive's programming model, it is no longer possible to optimize to fit in the spark model.

Testing the shark for a while, and testing spark SQL, but still can't help to explore the spark SQL, from the source code point of view of Spark SQL core execution process.

First, the Primer

Let's take a look at a simple spark SQL program:

[Java]View PlainCopy
  1. 1. val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)
  2. 2. Import sqlcontext._
  3. 3.Case class person (name:string, age:int)
  4. 4.val people = sc.textfile ("examples/src/main/resources/people.txt"). Map (_.split (",")). Map (p = person (p (0), p (1). Trim.toint))
  5. 5.people.registerastable ("People")
  6. 6.val teenagers = SQL ("select name from people WHERE age >= and Age <=")
  7. 7.teenagers.map (t = "Name:" + t (0)). Collect (). foreach (println)


The first two sentences of the program 1 and 2 generate SqlContext, import SqlContext The following all, that is, run sparksql context.
Program 3, 42 sentence is the Load Data Source Register table
The 6th sentence is the real entrance, is the SQL function, passing in a sentence of SQL, will first return a schemardd. This step is lazy until SQL executes when the action is executed by the collect of the seventh sentence.

Second, SqlContext

SqlContext is the context object that executes SQL, first look at which members it hold:

Catalog

A storage <tableName,logicalPlan> map structure that looks for relationships of directories, registries, logoff tables, query tables, and logical scheduling relationships classes.

Sqlparser

Parse incoming SQL to syntax participle, build syntax tree, return a logical plan

Analyzer

Logical Plan Parser

Optimizer

optimizer for logical plan

Logicalplan

The logical plan, made up of Catalyst TreeNode, can be seen with 3 syntax trees

Sparkplanner

Optimization strategies with different policies to optimize the physical execution plan

queryexecution

Environment context for SQL execution


It is these objects that make up the spark SQL runtime and look cool, with static metadata storage, a parser, optimizer, logical plan, physical plan, execution runtime.
So how do these objects work together to execute SQL statements?

III. Spark SQL Execution process

Not much to say, first, this diagram I use an online drawing tool process on the word, the picture is not good, the figure can express the line:



The core components are green boxes, and the result of each step is a blue box, and the method to call is the orange box.

To summarize, the approximate execution process is:
Parse SQL, Analyze Logical plan, Optimize Logical plan, Generate physical plan, prepareed Spark plan- > Execute SQL-Generate RDD

More specific execution processes:

SQL or HQL-SQL parser (parse) generation unresolved logical plan-to-analyzer (analysis) build analyzed logical plan-Optimi Zer (optimize) optimized logical plan, Spark Planner (use strategies to plan) generation physical plan, using different strategies to generate spar K Plan, Spark plan (prepare) prepared spark plan, call Tordd (execute () function invocation) executes SQL build RDD

3.1. Parse SQL

Back to the beginning of the program, we call the SQL function, actually is the SQL function in SqlContext its implementation is new a schemardd, in the generation of the call Parsesql method.

[Java]View PlainCopy
    1. /**
    2. * Executes a SQL query using Spark, returning the result as a schemardd.
    3. *
    4. * @group Userf
    5. */
    6. def SQL (sqltext:string): Schemardd = new Schemardd (This, Parsesql (sqltext))

As a result, a logical plan is generated

[Java]View PlainCopy
    1. @transient
    2. Protected[sql] Val parser = New catalyst. Sqlparser
    3. PROTECTED[SQL] def parsesql (sql:string): Logicalplan = parser (SQL)
3.2, Analyze to execution

When we call the Collect method inside the Schemardd, the queryexecution is initialized and execution starts.

[Java]View PlainCopy
    1. Override Def Collect (): Array[row] = QueryExecution.executedPlan.executeCollect ()

We can clearly see the execution steps:

[Java]View PlainCopy
  1. Protected Abstract class Queryexecution {
  2. def Logical:logicalplan
  3. Lazy val analyzed = Analyzer (logical) //First the parser will analyze the logical plan
  4. Lazy val Optimizedplan = optimizer (analyzed) //Subsequent optimizer to optimize the analysis after the logical plan
  5. //Todo:don ' t just pick ...
  6. Lazy val Sparkplan = Planner (Optimizedplan). Next () //Create plan physical plan based on policy
  7. //Executedplan should not being used to initialize any sparkplan. It should be
  8. //Only used for execution.
  9. Lazy val Executedplan:sparkplan = prepareforexecution (Sparkplan) //Final generation of the Ready Spark plan
  10. /** Internal version of the RDD. Avoids copies and has no schema * /
  11. Lazy val Tordd:rdd[row] = Executedplan.execute () //Last Call Tordd Method execution Task convert result to RDD
  12. protected def Stringorerror[a] (f: = = A): String =
  13. try f.tostring catch {case e:throwable = e.tostring}
  14. def simplestring:string = Stringorerror (Executedplan)
  15. Override Def tostring:string =
  16. S"" "= = Logical Plan = =
  17. |${stringorerror (analyzed)}
  18. |== Optimized Logical Plan = =
  19. |${stringorerror (Optimizedplan)}
  20. |== physical Plan = =
  21. |${stringorerror (Executedplan)}
  22. "" " . Stripmargin.trim
  23. }


This completes the process.

Iv. Summary:

By analyzing SqlContext We know which components are included in Spark SQL, Sqlparser,parser,analyzer,optimizer,logicalplan,sparkplanner (including physical Plan), Queryexecution.
By debugging the code, you know the execution flow of spark sql:
SQL or HQL-SQL parser (parse) generation unresolved logical plan-to-analyzer (analysis) build analyzed logical plan-Optimi Zer (optimize) optimized logical plan, Spark Planner (use strategies to plan) generation physical plan, using different strategies to generate spar K Plan, Spark plan (prepare) prepared spark plan, call Tordd (execute () function invocation) executes SQL build RDD

Each of these component objects is then studied to see what optimizations the catalyst has made.

--eof--

Original articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/37658021

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/37658021

First: The core process of Spark SQL source analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.