Since last year, Spark's Submit Michael Armbrust shared his catalyst, more than 1 years, spark SQL contributor from a few people to dozens of people, and the development speed is extremely rapid, the reason, personally think there are the following 2 points:
1. Integration: The SQL Type Query language is integrated into Spark's core RDD concept. This can be applied to a variety of tasks, stream processing, batch processing, including machine learning can be introduced in SQL.
2. Efficiency: Because the shark is constrained by the hive's programming model, it is no longer possible to optimize to fit in the spark model.
Testing the shark for a while, and testing spark SQL, but still can't help to explore the spark SQL, from the source code point of view of Spark SQL core execution process.
First, let's look at a simple spark SQL program:
1. val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) 2. Import Sqlcontext._3.case class person (name:string, age:int) 4.val people = Sc.textfile ("examples/src/main/resources/ People.txt "). Map (_.split (", ")). Map (P = = person (P (0), p (1). Trim.toint)) 5.people.registerastable (" People ") 6.val teenagers = SQL ("Select name from people WHERE the age >= and the Age <=") 7.teenagers.map (t = "Name:" + t (0)). Col Lect (). foreach (println)
The first two sentences of the program 1 and 2 generate SqlContext, import SqlContext The following all, that is, run sparksql context.
Program 3, 42 sentence is the Load Data Source Register table
The 6th sentence is the real entrance, is the SQL function, passing in a sentence of SQL, will first return a schemardd. This step is lazy until SQL executes when the action is executed by the collect of the seventh sentence.
Second, Sqlcontextsqlcontext is the context object for executing SQL, first look at what members it hold:
Catalog
A storage <tableName,logicalPlan> map structure that looks for relationships of directories, registries, logoff tables, query tables, and logical scheduling relationships classes.
Sqlparser
Parse incoming SQL to syntax participle, build syntax tree, return a logical plan
Analyzer
Logical Plan parser
Optimizer
optimizer for logical plan
Logicalplan
The logical plan, made up of Catalyst TreeNode, can be seen with 3 syntax trees
Sparkplanner
Optimization strategies with different policies to optimize the physical execution plan
queryexecution
Environment context for SQL execution
It is these objects that make up the spark SQL runtime and look cool, with static metadata storage, a parser, optimizer, logical plan, physical plan, execution runtime.
So how do these objects work together to execute SQL statements?
III. Spark SQL Execution processNot much to say, first, this diagram I use an online drawing tool process on the word, the picture is not good, the figure can express the line:
The core components are green boxes, and the result of each step is a blue box, and the method to call is the orange box.
To summarize, the approximate execution process is:
Parse SQL, Analyze Logical plan, Optimize Logical plan, Generate physical plan, Prepare Spark plan-&G T Executre SQL-Generate RDD
More specific execution processes:
SQL or HQL-SQL parser (parse) generation unresolved logical plan-to-analyzer (analysis) build analyzed logical plan-Optimi Zer (optimize) optimized logical plan, Spark Planner (use Stretage to plan) generates the physical plan, call next function to generate spark plan Spark Plan (prepare) prepared spark plan, call Tordd Execute SQL Build Rdd
3.1, Parse SQL back to the beginning of the program, we call the SQL function, in fact, the SQL function in the sqlcontext is the implementation of the new one Schemardd, at the time of Generation call Parsesql method.
/** * Executes a SQL query using Spark, returning the result as a schemardd. * * @group userf */ def SQL (sqltext:string): Schemardd = new Schemardd (this, Parsesql (SQLTEXT))
As a result, a logical plan is generated
@transient Protected[sql] Val parser = new Catalyst. Sqlparser Protected[sql] def parsesql (sql:string): Logicalplan = parser (SQL)
3.2, Analyze to execution when we call the Collect method inside Schemardd, the queryexecution is initialized and execution starts.
Override Def Collect (): Array[row] = QueryExecution.executedPlan.executeCollect ()
We can clearly see the execution steps:
Protected abstract class Queryexecution {def logical:logicalplan lazy val analyzed = Analyzer (logical)//First the parser will analyze Logic Plan lazy Val optimizedplan = optimizer (analyzed)//Subsequent optimizer to optimize the analysis after the logic plan//Todo:don ' t just pick the first one ... la Zy val Sparkplan = Planner (Optimizedplan). Next ()//Generate plan physical plan based on policy//Executedplan should not being used to initialize any Sparkplan. It should is//only used for execution. Lazy val Executedplan:sparkplan = prepareforexecution (Sparkplan)//final generation of the prepared spark Plan/** Internal version of the RD D. Avoids copies and has no schema */lazy val Tordd:rdd[row] = Executedplan.execute ()//Last Call Tordd method perform task convert result to RDD p rotected def Stringorerror[a] (f: = = A): String = try f.tostring Catch {case e:throwable = = e.tostring} de F simplestring:string = Stringorerror (Executedplan) override def tostring:string = S "" "= = Logical Plan = = |${stringorerror (analyzed)} |== Optimized Logical Plan = = |${stringoreRror (Optimizedplan)} |== Physical Plan = = |${stringorerror (Executedplan)} "" ". Stripmargin.trim}
This completes the process.
Iv. Summary: By analyzing sqlcontext we know which components are included in Spark SQL, Sqlparser,parser,analyzer,optimizer,logicalplan, Sparkplanner (including physical Plan), queryexecution.
By debugging the code, you know the execution flow of spark sql:
SQL or HQL-SQL parser (parse) generates unresolved logical plan-to-analyzer (analysis) build analyzed logical Plan--op Timizer (optimize) optimized logical plan, Spark Planner (use Stretage to plan) generates the physical plan-and call next function to generate spark Plan, Spark Plan (prepare) prepared spark plan, call Tordd Execute SQL build Rdd
Each of these component objects is then studied to see what optimizations the catalyst has made.
--eof--
original article: Reprint please specify from: http://blog.csdn.net/oopsoom/article/details/37658021