The previous two chapters took a lot of space to introduce the sparksql of the operation process, many readers still feel that the concept is very abstract, such as unresolved Logicplan, Logicplan, Physicalplan is what looks like, no impression, only know the noun , it feels very misty. This chapter focuses on a tool hive/console to deepen the reader's understanding of Sparksql's operational plan.
The 1:hive/console installation Sparksql provides a Sparksql debug tool hive/console from 1.0.0 onwards. The tool is for developers, not in the build-to-install deployment package, which needs to be run with SBT compilation. To use the tool, the following conditions are required:
- spark1.1.0 Source
- hive0.12 source code and compile
- Configuring Environment variables
1.1: Installation Hive/cosole The following is the author installation process: A: Download spark1.1.0 source code, installed in/app/hadoop/spark110_sql directory B: Download hive0.12 source code, installed in/app/hadoop/ hive012 directory, after entering the SRC directory, use the following command to compile:
Ant Clean package-dhadoop.version=2.2.0-dhadoop-0.23.version=2.2.0-dhadoop.mr.rev=23
C: After configuring the environment variable file ~/.BASHRC, source ~/.BASHRC makes the environment variable effective.
Export Hive_home=/app/hadoop/hive012/src/build/distexport Hive_dev_home=/app/hadoop/hive012/srcexport HADOOP_HOME =/app/hadoop/hadoop220
D: Start switch to spark installation directory/app/hadoop/spark110_sql, run the command:
SBT/SBT Hive/console
After a lengthy SBT compilation process, the following interface appears:
At the Scala prompt in the console, enter: Help for assistance, and the TAB key displays the currently available methods, functions, and variables. The methods and functions that are displayed when you press the TAB key, and the variables that you define or use are displayed as users continue to use the control state.
The debugging principle of the 1.2:hive/console principle hive/console is simple, that is, the Scala console loads several key classes in the catalyst, where the testhive pre-defines the table structure and loads the commands, The data is the test data in the hive0.12 source code, which is loaded on demand and is located in/app/hadoop/hive012/src/data, $hive_dev_home/data.
/* FROM Sql/hive/src/main/scala/org/apache/spark/sql/hive/testhive.scala *//The test tables that is defined in the hive Qtestutil. /itests/util/src/main/java/org/apache/hadoop/hive/ql/qtestutil.java val hiveqtestutiltables = Seq (TestTable ("src "," CREATE TABLE src (key INT, value STRING) ". Cmd, S" LOAD DATA LOCAL inpath ' ${gethivefile ("Data/files/kv1.txt") } ' into TABLE src '. cmd, testtable ("Src1", "CREATE TABLE src1 (key INT, value STRING)". Cmd, S "LOAD DATA LOCAL Inpath ' ${gethivefile ("Data/files/kv3.txt")} ' into TABLE src1 '. cmd, testtable ("Srcpart", () = {Runsqlhive ( "CREATE TABLE Srcpart (key INT, value string) partitioned by (DS string, hr string) ') for (DS <-Seq (" 2008- 04-08 "," 2008-04-09 "); HR <-Seq ("One", "one")) {runsqlhive (S "" "" LOAD DATA LOCAL inpath ' ${gethivefile ("Data/files/kv1.txt")} ' | OVERWRITE into TABLE srcpart PARTITION (ds= ' $ds ', hr= ' $hr ') "" ". Stripmargin)}),......)
Because you want to use hive0.12 test data, you need to define two environment variables: Hive_home and Hive_dev_home, and if you use hive0.13, the user needs to change to the appropriate directory:
/* FROM Sql/hive/src/main/scala/org/apache/spark/sql/hive/testhive.scala *//** the location of the compiled hive Distribution * /lazy val hivehome = Envvartofile ("Hive_home")/** the location of the HIVE source code. */ Laz Y val hivedevhome = envvartofile ("Hive_dev_home")
In addition, if users want to pre-load more classes when the Hive/console starts, you can modify the Project/sparkbuild.scala file under Spark source.
/* Originating from Project/sparkbuild.scala */object Hive {lazy val settings = Seq (javaoptions + = "-xx:maxpermsize=1g",//Mu Ltiple queries rely on the testhive Singleton. See comments there for more details. Parallelexecution in Test: = False,//supporting all SerDes requires us to depend on deprecated APIs, so we turn off T He warnings//only for this subproject. Scalacoptions <<= scalacoptions Map {currentopts:seq[string] = Currentopts.filternot (_ = = "-deprecation") }, initialcommands in console: = "" "|import org.apache.spark.sql.catalyst.analysis._ |import or G.apache.spark.sql.catalyst.dsl._ |import org.apache.spark.sql.catalyst.errors._ |import org.apache.spark.sq L.catalyst.expressions._ |import org.apache.spark.sql.catalyst.plans.logical._ |import org.apache.spark.sql. Catalyst.rules._ |import org.apache.spark.sql.catalyst.types._ |import org.apache.spark.sql.catalyst.util._ |impoRT org.apache.spark.sql.execution |import Org.apache.spark.sql.hive._ |import org.apache.spark.sql.hive.test . Testhive._ |import org.apache.spark.sql.parquet.ParquetTestData "" ". Stripmargin)}
2: Common operation The following describes the common operations of hive/console, mainly related to the operation of the common operations plan. Before you do this, you first define a table people and queries query:
Run the case class person (name:string, Age:int, state:string) sparkcontext.parallelize (The person ("Michael", "CA") on a line-by-row basis in the console: :P Erson ("Andy", "NY")::P Erson ("Justin", "("), "Ca")::P Erson ("Justin", +, "Ca"):: Nil). registertemptable ("People") Val query= SQL ("SELECT * from people")
2.1 Viewing the schema of a query
Query.printschema
2.2 Viewing the entire run schedule of a query
Query.queryexecution
2.3 View unresolved Logicalplan of a query
Query.queryExecution.logical
2.4 View analyzed Logicalplan of a query
Query.queryExecution.analyzed
2.5 Viewing the optimized Logicalplan
Query.queryExecution.optimizedPlan
2.6 Viewing the physical plan
Query.queryExecution.sparkPlan
2.7 Viewing the conversion process for RDD
Query.todebugstring
2.8 More operations can be displayed through the TAB key, you can also open the Sparksql API, you can also see the source code in the methods and functions.
3: Running plans for different data sources the data from the RDD is described in the common operations above, and we all know that sparksql can originate from multiple data sources: Jsonfile, Parquetfile, Hive. Below is a look at these data sources schema:3.1 JSON file JSON file support nested table, Sparksql can also read into the nested table, such as the form of JSON data, after trimming (go to space and line break) saved, you can use Jsonfile read into Sparksql.
{"FullName" : "Sean Kelly", "org": "SK Consulting", "Emailaddrs": [{"Type": "Work", "Value": "[Email pro Tected] "}, {" type ":" Home "," pref ": 1," value ":" [email protected] "}]," Telephones ": [ {"Type": "Work", "pref": 1, "value": "+1 214 555 1212"}, {"Type": "Fax", "value": "+1 214 555 1213"}, {"Type": "Mobile", "value": "+1 214 555 1214"}], "addresses": [{"Type": "Work", "format": "Us", "value": "1234 Main Stnspringfield, TX 78080-1216"}, {"type": "Home", "format": "Us", "Value": "5678 Main Stnspringfield, TX 78080-1316"}], "URLs": [{"Type": "Work", "Value": "Htt p://seankelly.biz/"}, {" type ":" Home "," value ":" http://seankelly.tv/"}]}
Save the space and newline characters to/home/mmicky/data/nestjson.json, use Jsonfile to read in and register as a table Jsonperson, and then define a query Jsonquery:
Jsonfile ("/home/mmicky/data/nestjson.json"). Registertemptable ("Jsonperson") val jsonquery = SQL ("SELECT * FROM Jsonperson ")
View the schema of the Jsonquery:
Jsonquery.printschema
View the entire run schedule for Jsonquery:
Jsonquery.queryexecution
3.2 Parquet file Parquet file is read into and registered as a table Parquetwiki, then define a query Parquetquery:
Parquetfile ("/home/mmicky/data/spark/wiki_parquet"). Registertemptable ("Parquetwiki") val parquetQuery = SQL (" SELECT * FROM Parquetwiki ")
Query the schema of Parquetquery:
Parquetquery.printschema
Query the entire run plan for Parquetquery:
Parquetquery.queryexecution
3.3 Hive data before that, the Testhive class has already defined a large number of hive0.12 test data table format, such as SRC, sales and so on, can be used directly in Hive/console, the first time the use of the hive/console will be loaded once. Below we use the Sales table to see its schema and the entire run plan. First define a query hivequery:
Val hivequery = SQL ("SELECT * FROM Sales")
View the schema of the Hivequery:
Hivequery.printschema
View the entire run schedule for Hivequery:
Hivequery.queryexecution
As can be seen from the above, the physical plans from Jsonfile, Parquetfile, and hive data are quite different.
4: Run schedule for different queries to deepen our understanding, we list a few common query run plans and the RDD conversion process. 4.1 Aggregate queries
SQL ("Select STATE,AVG (age) from people group by State"). Queryexecution
SQL ("Select STATE,AVG (age) from people group by State"). todebugstring
4.2 Join operation
SQL ("Select A.name,b.name from people a join people b where a.name=b.name"). Queryexecution
SQL ("Select A.name,b.name from people a join people b where a.name=b.name"). todebugstring
4.3 distinct operation
SQL ("SELECT distinct a.name,b.name from people a join people b where a.name=b.name"). Queryexecution
SQL ("SELECT distinct a.name,b.name from people a join people b where a.name=b.name"). todebugstring
5: Query optimization above the query is relatively simple, do not see the process of optimization, below to see a few examples, you can understand the Sparksql optimization process. 5.1 combinefilters combinefilters is the merge filter, which occurs when multiple filter is contained, the following query:
SQL ("Select name from (SELECT * from people where >=19) a where a.age <30"). Queryexecution
The above query, in the course of optimized, merges the two filter age>=19 and age<30 into ((age>=19) && (age<30)). In fact, the above also made a other optimization, is the push of project, the subquery uses all the columns of the table, and the main query uses the column name, when querying the data, the subquery is optimized to check column name only.
5.2 Pushpredicatethroughproject Pushpredicatethroughproject is Project push, just like project in the example above.
SQL ("Select name from (select Name,state as location from people) a where location= ' CA '"). Queryexecution
5.3 constantfolding constantfolding is a constant overlay for expressions. As in the following example:
SQL ("Select name,1+2 from People"). Queryexecution
In the process of optimized, constant expressions are added directly together and represented by a new column name.
5.4 Custom optimizations in the optimizer in Sparksql define the 3 Class 12 optimization method, which is no longer in the one by one exhibition. For custom optimizations, the Hive/console can also be easily debugged. Just define a Logicalplan first, and then test it with a custom optimization function. Here's an example of combinefilters, first defining a function:
Object Combinefilters extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = Plan Transform { CA SE Filter (c1, filter (c2, grandchild)) = = Filter (and (C1,C2), grandchild) }}
Then define a query and use query.queryExecution.analyzed to view the Logicplan before optimization:
Val query= SQL ("SELECT * from people"). where (' Age >=19 '). Where (' Age <30 ') query.queryExecution.analyzed
Finally, use a custom optimization function to optimize:
Combinefilters (query.queryExecution.analyzed)
You can see that two filter combinations are merged together. Even in hive/console, the rule defined for the Logicplan application is directly used in the transform, and a query is defined below. and use query.queryExecution.analyzed to see the Logicplan before applying rule:
Val hivequery = SQL ("SELECT * FROM (SELECT * from SRC) a") hiveQuery.queryExecution.analyzed
Then, direct the custom rule with transform:
HiveQuery.queryExecution.analyzed Transform {case Project (projectlist, child) if projectlist = = Child.output = Child}
The transform combines project with project in Logicplan's main query and subquery.
After the above example, plus your own understanding, I believe most readers of the sparksql in the operation of the plan should have a relatively clear understanding.
sparkSQL1.1: Getting to know the Sparksql run plan