Spark's SQL parsing (source reading 10)

Source: Internet
Author: User

How can we better use and monitor the Sparksql? Maybe we'll have a deeper understanding of what it's all about. The difference between the traditional database and Spark's SQL parsing has been written in the previous summary. Then let's get down to the straight-cut theme ~

Today's spark supports a wide variety of data source queries and loads, is compatible with hive, and can be used in JDBC or ODBC to connect to spark SQL. The architecture given for the website. So sparksql? You can reuse the metadata warehouse (metastore),HiveQL, and user-defined Functions (UDFs) and serialization provided by Hive itself and the deserialized tool (SerDes).

Down we come to refine the sparkcontext, the big process is this:

1, SQL statements through Sqlparser parsing into unresolved Logicalplan;

2, using the Analyzer combined with data dictionary (catalog) to bind to generate resolved Logicalplan;

3, using optimizer to optimize the resolved Logicalplan , generate Optimized Logicalplan;

4, using Sparkplan to convert Logicalplan into Physiclplan;

5. Use prepareforexception to convert Physicalplan into an executable physical plan.

6. Execute The physical plan using execute () to generate the dataframe.

  These analytic processes, which we can all observe through the monitoring page.

Let's start with the first catalog, what is catalog? It is a dictionary table for the registry, after the cache for easy query , the source code is as follows:

  

This class is a trait that defines some tableexistes: The existence of a judgment table,registertable: Registry Ah,Unregisteralltables : Clears all registered tables and so on. At the time of creation, new is the Simplecatalog implementation class, which implements all the interfaces in the catalog, putting the table name and the Logicalplan into the table cache, the previous version, The mutable is used . Hashmap[string,logicalplan]. Now the statement is Concurrenthashmap[string,logicalplan]

Then, let's look at the implementation of the lexical parser parser. In the previous version, the SQL method was called, returning the Schemardd, and now the return type is DataFrame:

  

You will find that the parsesql is called, and a physical plan is returned when the parsing is complete.

  

We go deeper into the parse method and find that the Apply method is called implicitly:

  

Down we take a look at the statement parsing it, and you'll find that it's actually parsing the physical plan and then pattern matching to create the table:

  

Finally, the Run method in Refreshtable is called:

  

So after creating the table, down begins the painful SQL parsing ... On the legendary operator function with all SQL functions parsed!

  

   

A look at the end ... The keyword is actually parsing the SQL statement:

  

Then take a select SQL syntax parsing as an example, the essence is to match the conditions of the SQL statement , filtering filtering :

  

The steps for a select include, get distinct statement , project field projection, table relations,where expression ,group the expression after the by, the expression afterhiving , the sort field ordering,the limit after the expression. Then the matching package operation Rdd,filter, Aggregate, Project, Distinct, sort, Limit, eventually form a tree of Logicalplan.

The join operation also includes a left outer connection, a full outer join, a Cartesian product, and so on.

  

Well, since the SQL execution plan has been parsed, it's time to optimize the parsed execution plan, and the parsing process will parse the SQL for a unresolved Logicalplan tree. Down Analyzer and optimizer will add a variety of analysis and optimization operations to the Logicalplan tree, such as column pruning AH predicate under pressure AH.

  Analyzer binds the unresolved Logicalplan to the data Dictionary (catalog) to generate resolved Logicalplan. And then what? Optimizer resolved Logicalplan is optimized to generate Optimized Logicalplan.

  

The Usecacheddata method Here is actually used to replace the Logicalplan tree segment with the cache. Specific filtering optimization can not understand ah tat forget it. The first time the source code, pay attention to the first full pass.

Down, a series of analytic ah, analysis Ah, optimization ah after the operation, because the generated logical execution plan cannot be treated as a general job, so in order to be able to treat the logical execution plan as the other job, you need to change the logical execution plan into a physical execution plan.

  

For example, you notice that the number of shufflepartition in the configuration file is passed in from here.

  

  

It's the basicoperators that really freaks out here. It handles the most commonly used SQL keywords, each processing branch calls the Planlater method, and the Planlater method gives the Logicalplan application Sparkplanner to the child node, Thus, the process of iterative processing is formed. The final implementation uses the entire Logicalplan tree with Sparkplanner to complete the conversion. Final execution of the physical plan.

  

  

Reference: "In-depth understanding of spark: core ideas and source code analysis"

Spark's SQL parsing (source reading 10)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.