Welcome reprint, Reprint please indicate the source, emblem Shanghai one lang.
Profile
There is a new feature in the upcoming Spark 1.0, the support for SQL, which means that SQL can be used to query the data, which is undoubtedly a boon for DBAs, since the previous knowledge continues to take effect without having to learn any Scala or other script.
In general, any SQL subsystem needs to have parser,optimizer,execution three functional modules, how are these implemented in spark, and what are the highlights and problems of these implementations? With these questions, this article is going to do some more in-depth analysis.
There are several difficulties in SQL module analysis, namely
- The common process of SQL analysis and execution, which is not related to spark, should be a very general issue
- The overall architecture in spark SQL when implemented specifically
- The code reads the Scala special grammar, is often said the grammatical sugar problem
Why do I need SQL
SQL is a standard, a standard for data analysis that has existed for many years.
In the context of big data, with the increasing scale of data, is the original analysis techniques outdated? The answer is clearly no, the original analytical skills remain valid in the existing analytical dimension, of course, for the new data we want to dig out more interesting and valuable content, this goal can be given to data mining or machine learning to complete.
So how can the original data analysts quickly switch to Big Data's platform, to re-learn a script, directly in Scala or Python to write the Rdd. Obviously the price is too high and the cost of learning is great. Data analysts hope that the transformation of the underlying storage mechanism and analysis engine does not have a direct impact on the application of the upper-level analysis, and the need to express in a sentence is, "directly using SQL statements to analyze the data."
That's why hive has sprung up. The popularity of hive directly proves that this design caters to the needs of the market. Because Hive uses the MapReduce of Hadoop as the analysis execution engine, its processing speed is not as satisfactory. Spark is famous for its fast, and soon some good people have written Shark,shark achieved very good results, welcomed the excellent reputation.
After all, Shark is a project outside of spark that is not controlled by spark, so the goal of the Spark development team is to put the SQL support into the core functionality of Spark. The above analysis is the origin of the SQL functionality in Spark.
Application examples
Val SqlContext =new Org.apache.spark.sql.SQLContext (SC); import sqlcontext._case class person (name:string, age:int) val person = sc.textfile ( "Examples/src/main/resources/people.txt") . Map (_.split (0), p (1). Trim.toint) person.registerastable (" person ") val teenagers = SQL ( "select name, age from Person WHERE age >= 13 and age <= ") teenagers.map (t = " name: "+ t (0). Collect () . foreach (println)
The logic of the above code is very clear, which is to print out the names of young people aged 13-19 years old in Person.txt.
SQL common execution Process SQL component
SQL statement Everyone is very familiar with, then have you ever thought about how many parts of it are composed? Perhaps you would say, "This is also asked, not just" select * from TableX where f1=? ", what do you think? “
Let's take a look at it first, maybe some new thinking in it?
is a re-labeling of the simplest SQL statement, select represents a specific operation, that is, query data, "F1,F2,F3" represents the returned result, TableX is the data source, condition part is the query criteria. Have you noticed that the order of the SQL expressions differs from the usual RDD processing logic in the order in which they are expressed? or continue to use the graph to express the difference.
SQL statements undergo several steps as shown during analysis execution
- Syntax parsing
- Action Bindings
- Optimize execution strategy
- Delivery execution
Syntax parsing
After parsing, a syntax tree is formed, as shown in. Each node in the tree is a rule that executes, and the entire tree is called the execution strategy.
Policy optimization
Forming the execution strategy tree described above is only the first step, because the execution strategy can be optimized, so-called optimization is to merge the nodes in the tree or make a sequential adjustment.
Take a familiar join operation as an example, and give a sample of join optimizations. A join B is equivalent to B join A, but the sequential adjustment can have a significant effect on the performance of the execution, that is, the contrast chart before and after the adjustment.
As an example, it is generally possible to implement the aggregation operation (Aggregate) First and then join
Summary
Above a chase analysis, hope to achieve the purpose of two.
- Generate an execution policy tree after parsing syntax
- The execution policy tree can be optimized, and the process of optimization is to merge or order the nodes in the tree
For the specific process of SQL query Analysis optimization, it is highly recommended to refer to query optimizer deep Dive series articles
The implementation of SQL in spark
With the above content of the foreshadowing, you must have realized that spark if you want to support SQL, it is bound to complete, analysis, optimization, implementation of the three major processes.
The entire SQL section of the code, which is broadly categorized as shown in
- Sqlparser Generate Logicplan Tree
- Analyzer and optimizer the various rule actions on the Logicalplan Tree
- The Logicalplan generated by the final optimization generates the spark RDD
- Finally, the resulting rdd is handed to spark for execution
Phase 1: Generate Logicalplan
A new Rdd, the Schemardd, is introduced into SQL.
And look at the Schemardd's constructor
class SchemaRDD( @transient val sqlContext: SQLContext, @transient protected[spark] val logicalPlan: LogicalPlan)
A total of two in the constructor is Sparkcontext, and the other Logicalplan. How is Logicalplan generated?
To answer this question, you have to go back to the whole question of the entry point SQL function,the SQL function is defined as follows
def sql(sqlText: String): SchemaRDD = { val result = new SchemaRDD(this, parseSql(sqlText)) result.queryExecution.toRdd result }
parsesql (SQLTEXT) is responsible for generating logicalplan,parsesql, which is an instance of Sqlparser.
Sqlparser This part of the code to understand the key is to figure out the standardtokenparsers call rules, there are a lot of symbols, if you do not understand what the meaning, it is difficult to clear the clue.
Since the Apply function can be called without being displayed, Parsesql (SQLTEXT) actually calls the Apply function in Sqlparser implicitly
def apply(input: String): LogicalPlan = { phrase(query)(new lexical.Scanner(input)) match { case Success(r, x) => r case x => sys.error(x.toString) } }
The most painful line of code is phrase (query) (new lexical). Scanner (input) here, translate it. If the input string entered conforms to the rules defined in lexical, continue to use query processing.
Look at what the definition of query is
protected lazy val query: Parser[LogicalPlan] = select * ( UNION ~ ALL ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Union(q1, q2) } | UNION ~ opt(DISTINCT) ^^^ { (q1: LogicalPlan, q2: LogicalPlan) => Distinct(Union(q1, q2)) } ) | insert
Here at last there is a logicalplan, that is to say, the ordinary string into Logicalplan happened here .
The code for query also shows that only select and insert Two operations are supported in current Spark SQL, and for Delete, update is not supported.
Note: Even now, it is estimated that you and the original for the use of sqlparser or confused, do not matter, please refer to ref[3] and [4] in the content, as for those strange symbols in the end what meaning, please refer to ref[5].
Stage 2:queryexecution
In the first phase, the string is transformed into a logicalplan tree, and the second stage acts on the Logicalplan.
In the first stage of the code shown, which sentence will trigger the optimization rule? Is the "Result.queryExecution.toRdd" in the SQL function, where the queryexecution is queryexecution. It also involves a syntactic sugar problem in Scala. Queryexecution is an abstract class, but sees the following code
protected[sql] def executePlan(plan: LogicalPlan): this.QueryExecution = new this.QueryExecution { val logical = plan }
How can I create an instance of an abstract class? My world has collapsed, hehe. Don't be nervous, this is permissible in the Scala world, except that Scala is implicitly creating a subclass of Queryexecution and initializing it, and the principles in Java are still right, and there's something fishy behind them.
Ok, the most important character in Stage 2 is queryexecution.
ProtectedAbstractClassqueryexecution {def Logical:logicalplanLazyVal analyzed = Analyzer (logical)LazyVal Optimizedplan = optimizer (analyzed)LazyVal Sparkplan = Planner (Optimizedplan). Next ()LazyVal Executedplan:sparkplan = prepareforexecution (Sparkplan)/** Internal version of the RDD. Avoids copies and has no schema */lazy val Tordd:rdd[row] = Executedplan.execute () protected def Stringorerror[a] (f: = = A): String =
try f.tostring
Catch {case e:throwable = e.tostring} def simplestring:string = Stringorerror (executed Plan) override def tostring:string = s"" "= = Logical Plan = = |${stringorerror (analyzed)} |== Optimized Logica L Plan = = |${stringorerror (optimizedplan)} |== Physical Plan = = |${stringorerror (Executedplan)} "". Stripmargin.trim C11>def debugexec () = Debugquery (Executedplan). Execute (). Collect ()}
Three strides
- Lazy val analyzed = Analyzer (logical)
- Lazy val Optimizedplan = optimizer (analyzed)
- Lazy val Sparkplan = Planner (Optimizedplan). Next ()
Either analyzer or optimizer, they are ruleexecutor subclasses,
The default handler for Ruleexecutor is apply, which is the same for all subclasses, andRuleexecutor 's apply function is defined as follows.
def apply (plan:treetype): Treetype = {var Curplan = plan Batches.foreach {batch =Val Batchstartplan = Curplanvar iteration =1var Lastplan = CurplanVarContinue =TrueRun until fix point (or the max number of iterations as specified in the strategy.while (Continue) {Curplan = Batch.rules.foldLeft (Curplan) {Case (plan, rule) = =Val result = rule (plan)if (!result.fastequals (plan)) {Logger.trace (s"" "|=== applying Rule ${rule.rulename} = = = |${sidebyside (plan.treestring, result.treestring). mkstring (" \ n ")}" "". stripmargin)} result} Iteration + =1if (Iteration > Batch.strategy.maxIterations) {logger.info (s "Max iterations ($iteration) reached for batch ${batch.name}") continue = false} if ( Curplan.fastequals (Lastplan)) {logger.trace (S "Fixed point reached for batch ${batch.name} After $iteration iterations. ") continue = false} Lastplan = Curplan} if (!batchstartplan.fastequals (Curplan)) {Logger.debug (Selse {logger.trace (s "Batch ${batch.name} has no Effect. ")}} Curplan}
For Ruleexecutor subclasses, the main thing is to define your own batches to see how the batches in Analyzer is defined
val batches: Seq[Batch] = Seq( Batch("MultiInstanceRelations", Once, NewRelationInstances), Batch("CaseInsensitiveAttributeReferences", Once, (if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*), Batch("Resolution", fixedPoint, ResolveReferences :: ResolveRelations :: NewRelationInstances :: ImplicitGenerate :: StarExpansion :: ResolveFunctions :: GlobalAggregates :: typeCoercionRules :_*), Batch("AnalysisOperators", fixedPoint, EliminateAnalysisOperators) )
A series of rules are defined in batch, where the syntax sugar problem recurs. "How To understand :: this operator"? :: Indicates the meaning of cons, that is, the connection generates a list.
You need to specify a series of rule in the batch constructor, like Resolvereferences is rule, and the code for rule is not analyzed.
Phase 3:logicalplan converted into physical Plan
In Phase 3, the main code is two lines
- Lazy val Executeplan:sparkplan = prepareforexecution (Sparkplan)
- Lazy val Tordd:rdd[row] = Executedplan.execute ()
Unlike Logicalplan, Sparkplan's most important difference is the execute function
For the specific implementation of Sparkplan, but also divided into Unarynode, Leafnode and Binarynode, briefly, namely, single-eye operator operation, leaf node, binocular operator operation. Each subclass of the specific implementation can self-reference source code.
Phase 4: Trigger Rdd execution
The RDD is triggered to actually execute the process after looking at the previous articles to come to you, all of them in this line of code.
"name:"+p(0)).foreach(println)
If you really don't understand it, it's recommended to go back and read the execution process analysis of the spark job.
Summarize
This is where you can get your pen. It should be said that the SQL part of the code involved in the knowledge point is still more, the most important thing is to clarify two points, that is, the general processing of SQL statements. The other is the specific implementation mechanism in the Spark SQL subsystem.
The specific implementation of the Spark SQL sub-module revolves around the Logicalplan tree, one is to generate Logicalplan with Sqlparser, and the other is to use the ruleexecutor to apply various rule actions to Logicalplan. The final generation of the normal RDD will be processed to spark core.
Resources
- Spark Catalyst Source Analysis
- Query Optimizer Deep Diver
- Playing with Scala Parser Combinator
- Parsing Text with Scala
- Parser Api