The core process of Spark SQL source code Analysis

Source: Internet
Author: User

/** Spark SQL Source Code Analysis series Article */

Since last year, spark Submit Michael Armbrust shared his catalyst, to now more than 1 years, spark SQL contributor from several people to dozens of people, and the development speed is extremely rapid, the reason, personally feel there are 2 points:

1. Integration: The SQL Type Query language is integrated into Spark's core RDD concept. This can be applied to a variety of tasks, stream processing, batch processing, including machine learning can be introduced into SQL.
2. Efficiency: Because the shark is constrained by the Hive programming model, it is no longer possible to optimize to fit in the spark model.

A while ago measured shark, and spark SQL also carried out some tests, but still can not help spark SQL to explore the end, from the source point of view of the Spark SQL core operating process.

First, let's look at a simple spark SQL program:
1. val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) 2. Import Sqlcontext._3.case class person (name:string, age:int) 4.val people = Sc.textfile ("examples/src/main/resources/ People.txt "). Map (_.split (", ")). Map (P = = person (P (0), p (1). Trim.toint)) 5.people.registerastable (" People ") 6.val teenagers = SQL ("Select name from people WHERE the age >= and the Age <=") 7.teenagers.map (t = "Name:" + t (0)). Col Lect (). foreach (println)

The first two sentences of the program 1 and 2 generate SqlContext, import SqlContext The following all, that is, the execution of Sparksql context.
Procedure 3, 42 sentences are loaded into the Data source Register table
The 6th sentence is the real entrance, is the SQL function, passing in a sentence of SQL, will first return a schemardd. This step is lazy until SQL runs until the seventh sentence of the collect action is run.


Second, Sqlcontextsqlcontext is the context object that runs SQL, first look at what members it hold:
Catalog

A storage <tableName,logicalPlan> map structure that looks for the relationships of folders, brochures, logoff tables, query tables, and logical scheduling relationships classes.


Sqlparser

Parse incoming SQL to syntax participle, build syntax tree, return a logical plan


Analyzer

Logical Plan parser


Optimizer

optimizer for logical plan


Logicalplan

The logical plan, made up of Catalyst TreeNode, is able to see 3 syntax trees


Sparkplanner

Optimization strategies including different strategies to optimize the physical operational plan


queryexecution

Environment context for SQL run


It is these objects that make up the execution of Spark SQL and look very cool, with static metadata storage, with analyzers, optimizers, logical plans, physical plans, execution execution.
So how do these objects work together to run SQL statements?

Third, Spark SQL Run processNot much to say, first, this diagram I use an online drawing tool process on the words, the picture is not good, the figure can express:



The core components are green boxes, and the result of each step is a blue box, and the method to call is the orange box.

To summarize, the approximate operating process is:
Parse SQL, Analyze Logical plan, Optimize Logical plan, Generate physical plan, prepareed Spark plan- > Execute SQL-Generate RDD

A more detailed running process:

SQL or HQL-SQL parser (parse) generation unresolved logical plan-to-analyzer (analysis) build analyzed logical plan-Optimi Zer (optimize) optimized logical plan, Spark Planner (use strategies to plan) generation physical plan, creating spar with different strategies K Plan, Spark plan (prepare) prepared spark plan, call Tordd (execute () function calls) Run SQL Build Rdd


3.1, Parse SQL back to the beginning of the program, we call the SQL function, in fact, SqlContext SQL function its implementation is a new Schemardd, at the time of Generation call Parsesql method.
  /**   * Executes a SQL query using Spark, returning the result as a schemardd.   *   * @group userf */   def SQL (sqltext:string): Schemardd = new Schemardd (this, Parsesql (SQLTEXT))
As a result, a logical plan is generated
   @transient  Protected[sql] Val parser = new Catalyst. Sqlparser    Protected[sql] def parsesql (sql:string): Logicalplan = parser (SQL)
3.2, Analyze to execution when we call the Collect method inside the Schemardd, it initializes the queryexecution and starts the run.
Override Def Collect (): Array[row] = QueryExecution.executedPlan.executeCollect ()
We can see the running steps very clearly:

Protected abstract class Queryexecution {def logical:logicalplan lazy val analyzed = Analyzer (logical)//First the parser will analyze Logic Plan lazy Val optimizedplan = optimizer (analyzed)//Subsequent optimizer to optimize the analysis after the logic plan//Todo:don ' t just pick the first one ... la Zy val Sparkplan = Planner (Optimizedplan). Next ()//Generate plan physical plan based on policy//Executedplan should not being used to initialize any Sparkplan.    It should is//only used for execution. Lazy val Executedplan:sparkplan = prepareforexecution (Sparkplan)//final generation of the prepared spark Plan/** Internal version of the RD D. Avoids copies and has no schema */lazy val Tordd:rdd[row] = Executedplan.execute ()//Last Call Tordd method run task convert result to RDD p rotected def Stringorerror[a] (f: = = A): String = try f.tostring Catch {case e:throwable = = e.tostring} de         F simplestring:string = Stringorerror (Executedplan) override def tostring:string = S "" "= = Logical Plan = = |${stringorerror (analyzed)} |== Optimized Logical Plan = = |${stringoreRror (Optimizedplan)} |== Physical Plan = = |${stringorerror (Executedplan)} "" ". Stripmargin.trim} 

This completes the process.

Iv. Summary: By analyzing sqlcontext we know which components are included in Spark SQL, Sqlparser,parser,analyzer,optimizer,logicalplan, Sparkplanner (including physical Plan), queryexecution.
By debugging the code, you know the flow of spark sql:
SQL or HQL-SQL parser (parse) generates unresolved logical plan-to-analyzer (analysis) build analyzed logical Plan--op Timizer (optimize) optimized logical plan, Spark Planner (use strategies to plan) generates physical plan, with different strategies builds Spark Plan, Spark plan (prepare) prepared spark plan, call Tordd (execute () function calls) Run SQL build Rdd

Each of these component objects is then studied to see what optimizations are being made by catalyst.

--eof--

Original article: Reprint please specify from: http://blog.csdn.net/oopsoom/article/details/37658021

The core process of Spark SQL source code Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.