Spark SQL external DataSource external data source (ii) Source code analysis

Source: Internet
Author: User

Last week Spark1.2 just announced that the weekend at home nothing, to understand this feature, by the way to analyze the source code, to see how this feature is designed and implemented.

/**  Spark SQL source Code Analysis series Article * /

(Ps:external datasource Use article address: Spark SQL External DataSource External data Source (i) Demo sample http://blog.csdn.net/oopsoom/article/ details/42061077)

First, sources package core

Spark SQL provides the external DataSource API in Spark1.2. Developers can implement their own external data sources, such as Avro, CSV, JSON, parquet, and so on, based on interfaces.

In the Org/spark/sql/sources folder of the Spark SQL source code, we'll see the relevant code for external datasource.

Here is a special introduction to several:

1, Ddlparser

A sqlparser specifically responsible for parsing external data source SQL. Parse create temporary table XXX using options (key ' value ', key ' value ') creates a statement that loads the external data source table.

Protected lazy val Createtable:parser[logicalplan] =    CREATE ~ temporary ~ TABLE ~> ident ~ (USING ~> className) ~ (Options ~> Options) ^^ {case      tableName ~ Provider ~ opts =        createtableusing (tableName, provider, opts)    }

2, Createtableusing

A runnablecommand. Instantiate the relation from the external data source lib by reflection. Then register to the temp table.

private[sql] case class createtableusing (tablename:string, provider:string,//org    . Apache.spark.sql.json options:map[string, String]) extends Runnablecommand {def run (sqlcontext:sqlcontext) = { Val Loader = Utils.getcontextorsparkclassloader val Clazz:class[_] = try Loader.loadclass (provider) Catch {//do REFL ection case cnf:java.lang.ClassNotFoundException = Try Loader.loadclass (provider + "). DefaultSource ") Catch {case cnf:java.lang.ClassNotFoundException = Sys.error (S" Failed to load CL The "for data Source: $provider")} val DataSource = Clazz.newinstance (). asinstanceof[org.apache.spark.sql.s Ources. Relationprovider]//json Package Defaultdatasource val relation = datasource.createrelation (SqlContext, new Caseinsensitivemap (options)//Create Jsonrelation Sqlcontext.baserelationtoschemardd (relation). Registertemptable ( TableName)//Register Seq.empty}} 

2, Datasourcesstrategy

In the strategy article. I've talked about the role of Streategy to plan for generating physical plans.

This provides a strategy specifically for parsing external data sources.

In the end, different Physicalrdd will be produced according to different baserelation.

The scan policies for different baserelation are described below.

Private[sql] Object Datasourcestrategy extends strategy {  def apply (Plan:logicalplan): Seq[sparkplan] = plan Match { C2/>case physicaloperation (Projectlist, filters, l @ logicalrelation (t:catalystscan)) = =      Prunefilterprojectraw (        L,        projectlist,        filters,        (A, f) = T.buildscan (A, F)): Nil    ... Case    L @ logicalrelation (t:tablescan) =      execution. Physicalrdd (L.output, T.buildscan ()):: Nil case    _ = Nil  }
3, Interfaces.scala

This file defines a series of extensible external data source interfaces that we just need to implement for the external data source that you want to access.

The trait Relationprovider and baserelation, which are more important inside, are detailed below.

4, Filters.scala

This filter defines how to filter when loading an external data source. Note that it is time to load the external data source into the table instead of the filter in Spark.

This is a bit like the coprocessor of HBase, the query filter is done on the server, does not filter on the client side.

5, Logicalrelation

Encapsulates the Baserelation, inherits the Catalyst's Leafnode, realizes the multiinstancerelation.

Ii. External DataSource Register process use spark SQL Sql/json to do a demo sample and draw a flowchart, such as the following:


to register the flow of tables for external data sources:1. Provide an external data source file, for example JSON file. 2. Provide a class library that implements the interfaces required for external data sources. For example, SQL JSON packet, after the 1.2 version number was changed to external datasource implementation.

3. Introduction of SqlContext. Create a table using DDL, such as create temporary table XXX using options (key ' value ', key ' value ') 4, External datasource ddlparser will pars the SQL E5, parse, encapsulates an object that becomes a createtableusing class.

The class is a runnablecommand, and its run method directly runs the CREATE TABLE statement.

6, the class will create a org.apache.spark.sql.sources.RelationProvider by reflection. The trait definition is to be createrelation. such as JSON. Create the Jsonrelation, and if Avro, create the avrorelation.

7, Get External releation, directly call SqlContext Baserelationtoschemardd converted to SchemaRDD8, finally registertemptable (TableName) To register as table. can use SQL to query.
Third, External datasource analysis process First look at the diagram, such as the following:

The Spark SQL parsing SQL process is as follows: 1, analyzer resolves by rule and resolves unresolvedrelation to jsonrelation.

2, through the parse. Analyzer,optimizer finally got Jsonrelation (file:///path/to/shengli.json,1.0) 3, The Logicalplan is mapped to the physical plan Physicalrdd by sources Datasourcestrategy. 4. Physicalrdd includes rules on how to query external data. Ability to invoke the Execute () method to run a spark query.
Four, External Datasource interfaces in the first section I have introduced, the basic interfaces, mainly look at Baserelation and Relationprovider.

Suppose we want to implement an external data source, such as a Avro data source, that supports spark SQL operations Avro file.

So long must define avrorelation to inherit baserelation. At the same time also to achieve a relationprovider.



baserelation:is an abstraction of an external data source that containsMap of Schema。 AndHow to scan data rules。
Abstract class Baserelation {  def sqlcontext:sqlcontext  def Schema:structtype
Abstract class Prunedfilteredscan extends Baserelation {  def buildscan (requiredcolumns:array[string], filters: Array[filter]): Rdd[row]}

1, schema we assume that we define relation, we must rewrite the schema, that is, we have to describe the schema of the external data source. 2, Buildscan we define how to query external data sources. Provides 4 types of scan strategies. The corresponding 4 kinds of baserelation.

We support 4 kinds of baserelation. Divided into Tablescan, Prunedscan.   Prunedfilterscan,catalystscan. 1. Tablescan
The default scan policy. 2. Prunedscan: This allows you to pass in the specified column. Requiredcolumns.   Columns are cropped, columns that are not required are not loaded from the external data source. 3. Prunedfilterscan: The filter mechanism is added on the basis of the column clipping. Filter when the data is loaded. Instead of the filter when the client request is returned.

4. Catalystscan: The catalyst supports incoming expressions for scan. Supports column cropping and filter.


Relationprovider:We're going to implement this by accepting the parameters that were passed in after the parse. to generate the corresponding external Relation, which is a reflection of the production of external data source Relation interface.
Trait Relationprovider {  /**   * Returns A new base relation with the given parameters.   * note:the parameters ' keywords is case insensitive and this insensitivity was enforced * by the MAP which is   passed T o the function.   *  /def createRelation (Sqlcontext:sqlcontext, parameters:map[string, String]): Baserelation}

V. External DataSource definition Demo sample after Spark1.2, JSON and parquet are also replaced by implementing the External API for external data source queries. The following is an example of how a JSON external data source is defined as a demo to illustrate how it is implemented:

1, Jsonrelation
Definition processing for JSON files, schema and scan policies are based on Jsonrdd, and the details can be read Jsonrdd by themselves.
Private[sql] Case Class Jsonrelation (Filename:string, samplingratio:double) (    @transient val sqlcontext: SqlContext)  extends Tablescan {  private def Baserdd = SqlContext.sparkContext.textFile (fileName)//Read JSON file  override val schema =    Jsonrdd.inferschema (  //Jsonrdd InferSchema method. Can proactively identify the JSON schema. and type.      Baserdd,      samplingratio,      Sqlcontext.columnnameofcorruptrecord)  override def buildscan () =    Jsonrdd.jsonstringtorow (Baserdd, schema, Sqlcontext.columnnameofcorruptrecord)//This is still jsonrdd, Call Jsonstringtorow query to return row}

2, DefaultSourceIn parameters, you can obtain your own defined parameters such as the path passed in the options.
Here to accept the incoming reference, the paparazzi jsonrelation.

Private[sql] class DefaultSource extends Relationprovider {  /** Returns a new base relation with the given parameters. *  /override def createRelation (      sqlcontext:sqlcontext,      parameters:map[string, String]): Baserelation = {    val fileName = parameters.getorelse ("Path", Sys.error ("Option ' path ' not Specified"))    val samplingratio = Parameters.get ("Samplingratio"). Map (_.todouble). Getorelse (1.0)    jsonrelation (FileName, Samplingratio) ( sqlcontext)  }}

Six, summarize External DataSource source code analysis down. Can be summed up as 3 parts.

1, the external data source of the registration process 2, the external Data source table query the plan resolution process 3, how to define an external data source, rewrite baserelation define the external data source schema and scan rules. Define Relationprovider.    How to generate an external data source relation. External DataSource This part of the API may also be modified in a possible build, and now only involves queries. The other operations are not yet covered. --eof--

Original article. Reprint Please specify:

Reprinted from: Oopsoutofmemory Shengli blog. Oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/42064075

Note: This article is based on the attribution-NonCommercial use-prohibit deduction of 2.5 China (CC by-nc-nd 2.5 CN) Agreement, welcome reprint, forwarding and comment, but please keep the author's attribution and article link. Please contact me if you need to consult with us for commercial purposes or in connection with your authorization.


Spark SQL external DataSource external data source (ii) Source code analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.