11th: Spark SQL Source Analysis External DataSource external data source

Last Update:2017-09-26 Source: Internet

Author: User

Tags reflection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last week Spark1.2 just released, the weekend at home nothing, to understand this feature, by the way to analyze the source code, see how this feature is designed and implemented.

/** Spark SQL Source Analysis series Article */

(Ps:external datasource Use article address: Spark SQL External DataSource External Data source (a) example http://blog.csdn.net/oopsoom/article/details/ 42061077)

First, sources package core

Spark SQL provides the external DataSource API in Spark1.2, which enables developers to implement their own external data sources, such as Avro, CSV, JSON, parquet, etc., based on the interface.

In the Org/spark/sql/sources directory of the Spark SQL source code, we'll see the relevant code for external datasource. Here is a special introduction to several:

1, Ddlparser

Specifically responsible for parsing external data source SQL Sqlparser, parsing create temporary table XXX using options (key ' value ', key ' value ') creates statements that load external data source tables.

[Java]View PlainCopy

Protected lazy val Createtable:parser[logicalplan] =
CREATE ~ temporary ~ TABLE ~> ident ~ (USING ~> className) ~ (Options ~> options) ^^ {
Case TableName ~ Provider ~ opts =
Createtableusing (TableName, provider, opts)
}

2, Createtableusing

A runnablecommand that instantiates relation from the external data source lib by reflection, and then registers to the temp table.

[Java]View PlainCopy

Private[sql] case class createtableusing (
Tablename:string,
Provider:string, //Org.apache.spark.sql.json
Options:map[string, String]) extends Runnablecommand {
def run (sqlcontext:sqlcontext) = {
Val Loader = Utils.getcontextorsparkclassloader
Val clazz:class[_] = try Loader.loadclass (provider) catch { //do reflection
Case cnf:java.lang.ClassNotFoundException =
Try Loader.loadclass (provider + "). DefaultSource ") catch {
Case cnf:java.lang.ClassNotFoundException =
Sys.error (S"Failed to load class for data source: $provider")
}
}
Val DataSource = Clazz.newinstance (). Asinstanceof[org.apache.spark.sql.sources.relationprovider] // JSON package Defaultdatasource
Val relation = datasource.createrelation (SqlContext, new Caseinsensitivemap (options))//Create Jsonrelation
Sqlcontext.baserelationtoschemardd (relation). Registertemptable (TableName)//Registration
Seq.empty
}
}

2, Datasourcesstrategy

In the strategy article, I've talked about the role of Streategy to plan for generating physical plans. This provides a strategy specifically for parsing external data sources.

In the end, different Physicalrdd will be produced according to different baserelation. The scan policies for different baserelation are described below.

[Java]View PlainCopy

Private[sql] Object Datasourcestrategy extends strategy {
def apply (Plan:logicalplan): Seq[sparkplan] = plan Match {
Case physicaloperation (projectlist, filters, l @ logicalrelation (t:catalystscan)) = =
Prunefilterprojectraw (
L
Projectlist,
Filters
(A, f) = T.buildscan (A, F)): Nil
......
Case L @ logicalrelation (t:tablescan) =
Execution. Physicalrdd (L.output, T.buildscan ()):: Nil
Case _ = Nil
}

3, Interfaces.scala

This file defines a series of extensible external data source interfaces that we only need to implement for external data sources that you want to access. There are more important trait relationprovider and baserelation, which are described in detail below.

4, Filters.scala

This filter defines how to filter when external data sources are loaded. Note that it is time to load the external data source into the table instead of the filter in Spark. This is a bit like the coprocessor of HBase, the query filter is done on the server, does not filter on the client side.

5, Logicalrelation

Encapsulates the Baserelation, inherits the Catalyst's Leafnode, realizes the multiinstancerelation.

Ii. External DataSource Registration process use Spark SQL Sql/json to do an example, draw a flowchart, as follows: process for registering tables for external data sources:1. Provide an external data source file, such as a JSON file. 2. Provide a class library that implements the interfaces required for an external data source, such as a JSON packet under SQL, and a external datasource implementation after the 1.2 version. 3, introduce SqlContext, use DDL to create table, such as Create temporary table XXX using options (key ' value ', key ' value ') 4, The Ddlparser of External datasource will Parse5 the SQL and parse the object into a createtableusing class. The class is a runnablecommand, and its Run method executes the CREATE TABLE statement directly. 6. The class creates a org.apache.spark.sql.sources.RelationProvider through reflection, the trait definition to createrelation, such as JSON, to create jsonrelation, if Avro, The avrorelation is created. 7, Get External releation, directly call SqlContext Baserelationtoschemardd converted to SchemaRDD8, finally registertemptable (TableName) To register as a table, you can use SQL to query. External DataSource analytic process First look at the diagram, the figure is as follows: spark SQL parsing SQL process is as follows: 1, analyzer through rule analysis, Resolves the unresolvedrelation to jsonrelation. 2, through Parse,analyzer,optimizer finally get Jsonrelation (file:///path/to/shengli.json,1.0) 3, The Logicalplan is mapped to the physical plan Physicalrdd by sources Datasourcestrategy. 4. Physicalrdd contains rules for querying external data, and you can invoke the Execute () method to execute a spark query. Four, External Datasource interfaces in the first section I have introduced, the main interfaces, mainly look at Baserelation and Relationprovider. Such asWe want to implement an external data source, such as a Avro data source, that supports spark SQL operations Avro file. So long must define avrorelation to inherit baserelation. Also to achieve a relationprovider. baserelation:is an abstraction of an external data source that contains the schema mapping and the rules for how to scan the data. [Java]View PlainCopy

Abstract class Baserelation {
def Sqlcontext:sqlcontext
def Schema:structtype

[Java]View PlainCopy

Abstract class Prunedfilteredscan extends Baserelation {
def buildscan (Requiredcolumns:array[string], Filters:array[filter]): Rdd[row]
}

1, schema If we customize relation, we must rewrite the schema, that is, we must describe the schema for the external data source. 2. Buildscan we define how to query external data sources and provide 4 scan strategies, corresponding to 4 kinds of baserelation. We support 4 kinds of baserelation, divided into Tablescan, Prunedscan,prunedfilterscan,catalystscan. 1. Tablescan: The default scan policy. 2. Prunedscan: Here you can pass in the specified column, requiredcolumns, column clipping, and unnecessary columns are not loaded from the external data source. 3. Prunedfilterscan: On the basis of the column clipping and adding the filter mechanism, the filter is filtered when the data is loaded, instead of the filter when the client request returns. 4. Catalystscan: The catalyst supports incoming expressions for scan. Supports column cropping and filter. Relationprovider:We are going to implement this, accept the arguments passed in after parse to generate the corresponding external Relation, which is a reflection of the production of external data source Relation interface. [Java]View PlainCopy

Trait Relationprovider {
/**
* Returns a new base relation with the given parameters.
* note:the parameters ' keywords is case insensitive and this insensitivity is enforced
* By the MAP which is passed to the function.
*/
def createRelation (Sqlcontext:sqlcontext, parameters:map[string, String]): baserelation
}

External DataSource Definition example after Spark1.2, JSON and parquet are also replaced by implementing the External API for external data source queries. The following is an example of how the external data source for JSON is defined as a description of how it is implemented: 1, JsonrelationDefinition processing for JSON files, schema and scan policies are based on JSONRDD, and details can be read Jsonrdd by themselves. [Java]View PlainCopy

Private[sql] case class Jsonrelation (Filename:string, samplingratio:double) (
@transient val sqlcontext:sqlcontext)
extends Tablescan {
private Def Baserdd = SqlContext.sparkContext.textFile (fileName) //Read JSON file
Override Val Schema =
The Jsonrdd.inferschema ( ////Jsonrdd InferSchema method, which automatically recognizes the JSON schema, and the type types.
Baserdd,
Samplingratio,
Sqlcontext.columnnameofcorruptrecord)
Override Def buildscan () =
Jsonrdd.jsonstringtorow (Baserdd, schema, Sqlcontext.columnnameofcorruptrecord) //This is still jsonrdd, Call Jsonstringtorow query to return row
}

2, DefaultSourceCustom parameters such as path passed in the options can be obtained in parameters.
Here to accept the incoming parameters, come to paparazzi jsonrelation. [Java]View PlainCopy

Private[sql] class DefaultSource extends Relationprovider {
/** Returns A new base relation with the given parameters. * *
Override Def createRelation (
Sqlcontext:sqlcontext,
Parameters:map[string, String]): Baserelation = {
Val fileName = parameters.getorelse ("path", Sys.error ("Option ' path ' not Specified"))
Val samplingratio = Parameters.get ("Samplingratio"). Map (_.todouble). Getorelse (1.0)
Jsonrelation (FileName, Samplingratio) (SqlContext)
}
}

Six, summary External datasource source analysis down, can be summed up as 3 parts. 1, the external data source registration process 2, the external Data source table query plan resolution process 3, how to customize an external data source, overriding Baserelation defines the schema of the external data source and scan rules. Defines relationprovider, how to generate an external data source relation. External DataSource This part of the API may also be changed in the subsequent build, currently only involves the query, about the other operations have not been covered. --eof--

Original articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/42064075

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/42064075

11th: Spark SQL Source Analysis External DataSource external data source

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More