Background
This article mainly introduces the current implementation of the CLI in Spark SQL, the code will certainly have a lot of changes, so I am concerned about the core of the logic. The main comparison is the implementation of the Hive CLI, comparing where the spark SQL was modified and where it is consistent with the hive CLI. You can take a look at the contents of the Summary section first.
The Hive-thriftserver project for Spark SQL is its CLI implementation code, which describes the main implementation classes and relationships of the Hive CLI, and then describes the practice of the Spark SQL CLI.
Hive CLI
The core startup class is the Org.apache.hive.service.server.HiveServer2, starting mode:
try { Serveroptionsprocessor oproc = new Serveroptionsprocessor ("Hiveserver2"); if (!oproc.process (args)) { log.fatal ("Error starting HiveServer2 with given arguments"); System.exit ( -1); } hiveconf hiveconf = new hiveconf (); HiveServer2 Server = new HiveServer2 (); Server.init (hiveconf); Server.start (); } catch (Throwable t) { log.fatal ("Error starting HiveServer2", T); System.exit ( -1); }
HiveServer2 inherit Compositeservice class, Compositeservice class internal maintenance a servicelist, can join, delete, start, stop different services. HiveServer2 will join the Cliservice and Thriftcliservice two service at Init (hiveconf). Depending on the transport mode, if it is HTTP or HTTPS, use Thrifthttpcliservice, otherwise use Thriftbinarycliservice. Whichever thriftcliservice is passed in to the Cliservice reference, thrift is just a package.
After joining these services, the service is started up.
Cliservice also inherits from Compositeservice,cliservice at Init when it joins the SessionManager service, and according to Hiveconf, from Hadoop Shims in the UGI in the Serverusername.
SessionManager manages the open and close management functions of hive Connection, the existing connection will be maintained in a hashmap, value is Hivesession class, which is roughly user name, password, hive configuration and other info.
So almost everything in the cliservice was entrusted to SessionManager.
SessionManager is mainly Operationmanager this service, is the most important and the implementation of logic related classes, the following will be specifically said.
Also, with respect to Thriftcliservice, there are two implementations of subclasses, which only replicate the run () method, set thrift server-related network connections, and other cliservice invocation logic in the parent class Thriftcliservice itself.
In fact, many things in Thriftcliservice are entrusted to Cliservice.
So the above is roughly the hive CLI, the Thrift server-initiated process, and the interrelationships of several major classes.
Spark SQL CLI
Based on the logic of the hive CLI above, see how the CLI for spark SQL is doing.
The HiveThriftServer2 in Spark (the class name looks a bit strange) inherits the HiveServer2 of Hive and copies the Init method, It was initialized with the addition of Sparksqlcliservice and thriftbinarycliservice two services. The former inherits the cliservice of Hive, has some different logic, and the latter uses the Hive class directly, but the Sparksqlcliservice reference is passed in.
Inside the Sparksqlcliservice, a hive-like Cliservice has a sparksqlsessionmanager that inherits from Hive's SessionManager. There is also the logic to get Serverusername, the code and the Cliservice are the same.
Sparksqlsessionmanager the Init method, which has Spark's own Sparksqloperationmanager service, inherited from Hive's Operationmanager class.
Perhaps the above categories are a bit dizzy, in essence, are some packages, there is no big difference. What really matters is the Sparksqloperationmanager class, which defines how to use spark SQL to handle query operations.
Sparksqloperationmanager Key Logic
The CLI operation parent class for Hive has the following subclass inheritance system, representing the different types of operations that the Hive CLI will handle:
The upper part of the Executestatementoperation sub-class system is the actual and query-related operations, the lower part is some metadata read operations. Sparksqloperationmanager actually overwrites the execution logic of the Executestatementoperation subclass, and the metadata-related operations follow hive's original processing logic.
The original executestatementoperation processing logic for hive is this:
public static executestatementoperation newexecutestatementoperation ( hivesession parentsession, String Statement, map<string, String> Confoverlay, Boolean runAsync) { string[] tokens = Statement.trim (). Split ("\\s +"); String command = Tokens[0].tolowercase (); if ("Set". Equals (command)) { return new setoperation (parentsession, statement, confoverlay); } else if ("Dfs". Equals (command)) { return new dfsoperation (parentsession, statement, confoverlay); } else if ("Add". Equals ( command) { return new addresourceoperation (parentsession, statement, confoverlay); } else if ("delete"). Equals (command)) { return new deleteresourceoperation (parentsession, statement, confoverlay); } else { return new Sqloperation (parentsession, statement, Confoverlay, RunAsync); } }
Executestatementoperation is also divided into two parts, Hivecommandoperation and Sqloperation.
The different executestatementoperation subclasses end up with the corresponding commandprocessor subclasses to complete the operation request.
So how does spark rewrite Executestatementoperation's execution logic?
The core logic is as follows:
def run (): Unit = {Loginfo (S "Running query ' $statement '") setState (Oper ationstate.running) try {result = Hivecontext.sql (statement) Logdebug (result.queryExecution.toSt Ring ()) Val groupId = round (random * 1000000). ToString hiveContext.sparkContext.setJobGroup (groupId, STA tement) iter = result.queryExecution.toRdd.toLocalIterator Datatypes = Result.queryExecution.analyzed.ou Tput.map (_.datatype). ToArray Sethasresultset (True)} catch {//actually do need to catch throwab Le as some failures don ' t inherit from Exception and//Hiveserver'll silently swallow them. Case e:throwable = logError ("Error Executing query:", e) throw new Hivesqlexception (e.tostring) } setState (operationstate.finished)}
Statement is a string, the query itself, that calls the Hivecontext SQL () method, and returns a Schemardd. The logic of the Hivecontext is as follows:
Override Def SQL (sqltext:string): Schemardd = { //Todo:create a framework for registering parsers instead of just H ardcoding if statements. if (dialect = = "sql") { super.sql (sqltext) } else if (dialect = = "Hiveql") { new Schemardd (This, hiveql.parses QL (SQLText)) } else { Sys.error (S "Unsupported SQL dialect: $dialect. Try ' sql ' or ' HIVEQL ') } }
After the SQL () is finished, it returns a Schemardd with the parsed base logical plan. Subsequent
Logdebug (result.queryExecution.toString ())
This step triggers a number of processes for further analysis, optimization, and transition of the logical execution plan into a physical execution plan. After
Result.queryExecution.toRdd
Tordd This step is to trigger the calculation and return the result. These are some of the logic mentioned in the previous Spark SQL source analysis article.
In addition to this part, there are some schema conversion, data type conversion logic, because the catalyst side, has its own data row representation method, also has its own datatype, and the schema of this piece, in the generation of Schemardd also converted once. So when we return to the execution results, we need to have the logic of Tableschema, FieldSchema, which is converted back to hive.
The above shows how spark SQL transforms query execution into spark SQL.
Summarize
Basically, the implementation of Spark SQL in the CLI is very close to the CLI module in the Hive Service project, with the main class inheritance system and execution logic almost the same. The key logic for spark SQL modification is in the Operationmanager within the sessionmanager of Cliservice, and the query for non-metadata queries is dropped to spark SQL Hive Project Hivecontext.sql () to complete, through the return of the Schemardd, to further obtain the result data, get the schema information of the intermediate execution plan.
Complete the full text:)