sparkSQL1.1: An in-depth understanding of the Sparksql execution plan

Source: Internet
Author: User

The previous two chapters took a lot of space to introduce the implementation of Sparksql, a lot of readers still think the concept is very abstract. For example unresolved Logicplan, Logicplan, Physicalplan is what looks like, not a little impression. Just knowing the noun, it feels very misty.

This chapter focuses on a tool hive/console to deepen the reader's understanding of Sparksql's execution plan.


The 1:hive/console installation Sparksql provides a Sparksql debug tool hive/console from 1.0.0 start.

The tool is used by developers and is not available in the build-generated installation deployment package, which needs to be executed using SBT compilation. To use the tool, the following conditions are required:

    • spark1.1.0 Source Code
    • hive0.12 source code and compile
    • Configuring Environment variables

1.1: Install Hive/cosole The following is the author installation process: A: Download spark1.1.0 source code, installed in/app/hadoop/spark110_sql folder B: Download hive0.12 source code, installed in/app/hadoop/ hive012 folder, after entering the SRC folder, use the following command to compile:
Ant Clean package-dhadoop.version=2.2.0-dhadoop-0.23.version=2.2.0-dhadoop.mr.rev=23
C: After configuring the environment variable file ~/.BASHRC, source ~/.BASHRC makes the environment variable effective.
Export Hive_home=/app/hadoop/hive012/src/build/distexport Hive_dev_home=/app/hadoop/hive012/srcexport HADOOP_HOME =/app/hadoop/hadoop220
D: Start switching to the Spark installation folder/app/hadoop/spark110_sql, execute the command:
SBT/SBT Hive/console
After a lengthy SBT compilation process. Finally appears such as the following interface:
At the Scala prompt in the console, enter: Help is available, and the TAB key displays the methods, functions, and variables that you currently have. The methods and functions that are displayed when you press the TAB key. As users continue to use the control state, the user-defined or used variables are displayed.
1.2:hive/console principle Hive/console Debugging principle is very easy. is to load several key classes from the catalyst in the Scala console, where the testhive defines the table structure and loads the commands in advance. This data is the test data in the hive0.12 source code, and the load data is run on demand. The data is located in/app/hadoop/hive012/src/data. That is $hive_dev_home/data.

 /* FROM Sql/hive/src/main/scala/org/apache/spark/sql/hive/testhive.scala *//The test tables that is defined in the hive  Qtestutil. /itests/util/src/main/java/org/apache/hadoop/hive/ql/qtestutil.java val hiveqtestutiltables = Seq (TestTable ("src "," CREATE TABLE src (key INT, value STRING) ". Cmd, S" LOAD DATA LOCAL inpath ' ${gethivefile ("Data/files/kv1.txt")  } ' into TABLE src '. cmd, testtable ("Src1", "CREATE TABLE src1 (key INT, value STRING)". Cmd, S "LOAD DATA LOCAL        Inpath ' ${gethivefile ("Data/files/kv3.txt")} ' into TABLE src1 '. cmd, testtable ("Srcpart", () = {Runsqlhive ( "CREATE TABLE Srcpart (key INT, value string) partitioned by (DS string, hr string) ') for (DS <-Seq (" 2008- 04-08 "," 2008-04-09 ");             HR <-Seq ("One", "one")) {runsqlhive (S "" "" LOAD DATA LOCAL inpath ' ${gethivefile ("Data/files/kv1.txt")} ' | OVERWRITE into TABLE srcpart PARTITION (ds= ' $ds ', hr= ' $hr ') "" ". Stripmargin)}),......) 
Due to the use of hive0.12 test data. So you need to define two environment variables: Hive_home and Hive_dev_home. Suppose to use hive0.13 words. User needs to change to the corresponding folder:
/* FROM Sql/hive/src/main/scala/org/apache/spark/sql/hive/testhive.scala *//** the location of the  compiled hive Distribution *  /lazy val hivehome = Envvartofile ("Hive_home")/** the location of the  HIVE source code. */  Laz Y val hivedevhome = envvartofile ("Hive_dev_home")
Also, suppose the user wants to start the hive/console. Pre-load many other classes. Ability to change the Project/sparkbuild.scala file under Spark source code
 /* Originating from Project/sparkbuild.scala */object Hive {lazy val settings = Seq (javaoptions + = "-xx:maxpermsize=1g",//Mu Ltiple queries rely on the testhive Singleton.    See comments there for more details. Parallelexecution in Test: = False,//supporting all SerDes requires us to depend on deprecated APIs, so we turn off T    He warnings//only for this subproject.     Scalacoptions <<= scalacoptions Map {currentopts:seq[string] = Currentopts.filternot (_ = = "-deprecation") }, initialcommands in console: = "" "|import org.apache.spark.sql.catalyst.analysis._ |import or G.apache.spark.sql.catalyst.dsl._ |import org.apache.spark.sql.catalyst.errors._ |import org.apache.spark.sq L.catalyst.expressions._ |import org.apache.spark.sql.catalyst.plans.logical._ |import org.apache.spark.sql.        Catalyst.rules._ |import org.apache.spark.sql.catalyst.types._ |import org.apache.spark.sql.catalyst.util._ |impoRT org.apache.spark.sql.execution |import Org.apache.spark.sql.hive._ |import org.apache.spark.sql.hive.test . Testhive._ |import org.apache.spark.sql.parquet.ParquetTestData "" ". Stripmargin)}

2: Frequently used actions The following describes the frequently used operations of hive/console, which are mostly related to execution plans. Before you do this, you first define a table people and queries query:
On the console line-by-row execution of case class person (name:string, Age:int, state:string) sparkcontext.parallelize (Person ("Michael", "CA"): :P Erson ("Andy", "NY")::P Erson ("Justin", "("), "Ca")::P Erson ("Justin", +, "Ca"):: Nil). registertemptable ("People") Val query= SQL ("SELECT * from people")
2.1 Viewing the schema of a query
Query.printschema


2.2 Viewing the entire execution plan for a query
Query.queryexecution


2.3 View unresolved Logicalplan of a query
Query.queryExecution.logical


2.4 View analyzed Logicalplan of a query
Query.queryExecution.analyzed


2.5 Viewing the optimized Logicalplan
Query.queryExecution.optimizedPlan


2.6 Viewing the physical plan
Query.queryExecution.sparkPlan


2.7 Viewing the conversion process for RDD
Query.todebugstring


2.8 Many other operations many other operations can be displayed through the TAB key. can also participate in the Sparksql API, can also refer to the source code methods and functions.


3: Execution plan for different data sources the data from the RDD is described in the frequently used operations. We all know that. Sparksql can originate from multiple data sources: Jsonfile, Parquetfile, Hive.

Here's a look at the schema of these data sources:

The 3.1 JSON file JSON file supports nested tables, and sparksql can also read into nested tables, such as JSON data in the following form, which can be read into Sparksql using Jsonfile after trimming (go spaces and line breaks).
{     "fullname": "Sean Kelly",        "org": "SK Consulting",        "Emailaddrs": [           {"type": "Work", "Value": "[email Protected] "},           {" type ":" Home "," pref ": 1," value ":" [email protected] "}        ],         " telephones ": [           {" Type ":" Work "," pref ": 1," value ":" +1 214 555 1212 "},           {" type ":" Fax "," value ":" +1 214 555 1213 "},           {" type ":" Mobile "," VA Lue ":" +1 214 555 1214 "}        ],        " addresses ": [           {" type ":" Work "," format ":" Us ",            " value ":" 1234 Main STNSP Ringfield, TX 78080-1216 "},           {" type ":" Home "," format ":" Us ",            " value ":" 5678 Main Stnspringfield, TX 78080-1316 "}"        ,         "URLs": [           {"type": "Work", "Value": "http://seankelly.biz/"},           {"type": "Home", "Value": " http://seankelly.tv/"}        ]     }
Save the space and newline characters as/home/mmicky/data/nestjson.json, use Jsonfile to read in and register the table Jsonperson, and then define a query Jsonquery:
Jsonfile ("/home/mmicky/data/nestjson.json"). Registertemptable ("Jsonperson") val jsonquery = SQL ("SELECT * FROM Jsonperson ")
View the schema of the Jsonquery:
Jsonquery.printschema

View the entire execution plan for jsonquery:
Jsonquery.queryexecution


3.2 Parquet file Parquet file read in and register as table Parquetwiki, then define a query Parquetquery:
Parquetfile ("/home/mmicky/data/spark/wiki_parquet"). Registertemptable ("Parquetwiki") val parquetQuery = SQL (" SELECT * FROM Parquetwiki ")
Query the schema of Parquetquery:
Parquetquery.printschema

Query the entire execution plan for parquetquery:
Parquetquery.queryexecution


3.3 Hive Data said before, the Testhive class has already defined a large number of hive0.12 test data table format, such as SRC, sales and so on, can be used directly in Hive/console, the first time the use of the hive/console will be loaded once.

Below we use the Sales table to see its schema and the entire execution plan. First define a query hivequery:

Val hivequery = SQL ("SELECT * FROM Sales")
View the schema of the Hivequery:
Hivequery.printschema

View the entire execution plan for hivequery:
Hivequery.queryexecution

As can be seen from the above, the physical plans from Jsonfile, Parquetfile, and hive data are very different.
4: Execution plan for different queries to deepen our understanding, we list several execution plans and RDD conversion processes that frequently use queries. 4.1 Aggregate queries
SQL ("Select STATE,AVG (age) from people group by State"). Queryexecution

SQL ("Select STATE,AVG (age) from people group by State"). todebugstring

4.2 Join operation
SQL ("Select A.name,b.name from people a join people b where a.name=b.name"). Queryexecution

SQL ("Select A.name,b.name from people a join people b where a.name=b.name"). todebugstring


4.3 distinct operation
SQL ("SELECT distinct a.name,b.name from people a join people b where a.name=b.name"). Queryexecution

SQL ("SELECT distinct a.name,b.name from people a join people b where a.name=b.name"). todebugstring

5: Query optimization above the query is relatively simple. The process of optimization can not be seen, following a few examples to understand the optimization process of sparksql.

5.1 combinefilters combinefilters is a merge filter that occurs when multiple filter is contained. For example, the following query:
SQL ("Select name from (SELECT * from people where >=19) a where a.age <30"). Queryexecution

The above query, in the process of optimized. Merge the two filter age>=19 and age<30 into ((age>=19) && (age<30)). In fact, there is another optimization, that is, the push of project, the subquery uses all the columns of the table, and the main query uses the column name. Subqueries are optimized to look only at the column name when querying the data.
5.2 Pushpredicatethroughproject Pushpredicatethroughproject is Project push. Like project in the example above.
SQL ("Select name from (select Name,state as location from people) a where location= ' CA '"). Queryexecution

5.3 constantfolding constantfolding is a constant overlay for expressions. Examples are as follows:
SQL ("Select name,1+2 from People"). Queryexecution

In the process of optimized, constant expressions are added directly together. Represented by a new column name.
5.4 Own definition optimization in the optimizer of Sparksql defines the 3 Class 12 optimization method, here no longer one by one exhibits. For your own definition of optimization, the hive/console can also be very easy to debug. Just define a Logicalplan and then test it with your own defined optimization function. Here's a sample like Combinefilters, first defining a function:
Object Combinefilters extends Rule[logicalplan] {  def apply (Plan:logicalplan): Logicalplan = Plan Transform {    CA SE Filter (c1, filter (c2, grandchild)) = =      Filter (and (C1,C2), grandchild)  }}
Then define a query and use query.queryExecution.analyzed to view the Logicplan before optimization:
Val query= SQL ("SELECT * from people"). where (' Age >=19 '). Where (' Age <30 ') query.queryExecution.analyzed

At last. Use your own definition optimization function for optimization:
Combinefilters (query.queryExecution.analyzed)

The ability to see two filter merges together.

Even in hive/console, the rule defined for the Logicplan application is directly used in the transform, and a query is defined below. and use query.queryExecution.analyzed to see the Logicplan before applying rule:
Val hivequery = SQL ("SELECT * FROM (SELECT * from SRC) a") hiveQuery.queryExecution.analyzed

Then, the rule that you define yourself directly with transform:
HiveQuery.queryExecution.analyzed Transform {case   Project (projectlist, child) if projectlist = = Child.output = Child}

The transform merges project with project in Logicplan's main query and sub-query.
Go through the example above. Add to your own understanding. It is believed that most readers should have a better understanding of the implementation plan in Sparksql.

sparkSQL1.1: An in-depth understanding of the Sparksql execution plan

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.