Last week Spark1.2 just released, the weekend at home nothing, to understand this feature, by the way to analyze the source code, see how this feature is designed and implemented.
/** Spark SQL Source Analysis series Article */
(Ps:external datasource Use article address: Spark SQL External DataSource External Data source (a) example http://blog.csdn.net/oopsoom/article/details/ 42061077)
First, sources package core
Spark SQL provides the external DataSource API in Spark1.2, which enables developers to implement their own external data sources, such as Avro, CSV, JSON, parquet, etc., based on the interface.
In the Org/spark/sql/sources directory of the Spark SQL source code, we'll see the relevant code for external datasource. Here is a special introduction to several:
1, Ddlparser
Specifically responsible for parsing external data source SQL Sqlparser, parsing create temporary table XXX using options (key ' value ', key ' value ') creates statements that load external data source tables.
[Java]View PlainCopy
- Protected lazy val Createtable:parser[logicalplan] =
- CREATE ~ temporary ~ TABLE ~> ident ~ (USING ~> className) ~ (Options ~> options) ^^ {
- Case TableName ~ Provider ~ opts =
- Createtableusing (TableName, provider, opts)
- }
2, Createtableusing
A runnablecommand that instantiates relation from the external data source lib by reflection, and then registers to the temp table.
[Java]View PlainCopy
- Private[sql] case class createtableusing (
- Tablename:string,
- Provider:string, //Org.apache.spark.sql.json
- Options:map[string, String]) extends Runnablecommand {
- def run (sqlcontext:sqlcontext) = {
- Val Loader = Utils.getcontextorsparkclassloader
- Val clazz:class[_] = try Loader.loadclass (provider) catch { //do reflection
- Case cnf:java.lang.ClassNotFoundException =
- Try Loader.loadclass (provider + "). DefaultSource ") catch {
- Case cnf:java.lang.ClassNotFoundException =
- Sys.error (S"Failed to load class for data source: $provider")
- }
- }
- Val DataSource = Clazz.newinstance (). Asinstanceof[org.apache.spark.sql.sources.relationprovider] // JSON package Defaultdatasource
- Val relation = datasource.createrelation (SqlContext, new Caseinsensitivemap (options))//Create Jsonrelation
- Sqlcontext.baserelationtoschemardd (relation). Registertemptable (TableName)//Registration
- Seq.empty
- }
- }
2, Datasourcesstrategy
In the strategy article, I've talked about the role of Streategy to plan for generating physical plans. This provides a strategy specifically for parsing external data sources.
In the end, different Physicalrdd will be produced according to different baserelation. The scan policies for different baserelation are described below.
[Java]View PlainCopy
- Private[sql] Object Datasourcestrategy extends strategy {
- def apply (Plan:logicalplan): Seq[sparkplan] = plan Match {
- Case physicaloperation (projectlist, filters, l @ logicalrelation (t:catalystscan)) = =
- Prunefilterprojectraw (
- L
- Projectlist,
- Filters
- (A, f) = T.buildscan (A, F)): Nil
- ......
- Case L @ logicalrelation (t:tablescan) =
- Execution. Physicalrdd (L.output, T.buildscan ()):: Nil
- Case _ = Nil
- }
3, Interfaces.scala
This file defines a series of extensible external data source interfaces that we only need to implement for external data sources that you want to access. There are more important trait relationprovider and baserelation, which are described in detail below.
4, Filters.scala
This filter defines how to filter when external data sources are loaded. Note that it is time to load the external data source into the table instead of the filter in Spark. This is a bit like the coprocessor of HBase, the query filter is done on the server, does not filter on the client side.
5, Logicalrelation
Encapsulates the Baserelation, inherits the Catalyst's Leafnode, realizes the multiinstancerelation.
Ii. External DataSource Registration process use Spark SQL Sql/json to do an example, draw a flowchart, as follows:
process for registering tables for external data sources:1. Provide an external data source file, such as a JSON file. 2. Provide a class library that implements the interfaces required for an external data source, such as a JSON packet under SQL, and a external datasource implementation after the 1.2 version. 3, introduce SqlContext, use DDL to create table, such as Create temporary table XXX using options (key ' value ', key ' value ') 4, The Ddlparser of External datasource will Parse5 the SQL and parse the object into a createtableusing class. The class is a runnablecommand, and its Run method executes the CREATE TABLE statement directly. 6. The class creates a org.apache.spark.sql.sources.RelationProvider through reflection, the trait definition to createrelation, such as JSON, to create jsonrelation, if Avro, The avrorelation is created. 7, Get External releation, directly call SqlContext Baserelationtoschemardd converted to SchemaRDD8, finally registertemptable (TableName) To register as a table, you can use SQL to query. External DataSource analytic process First look at the diagram, the figure is as follows: spark SQL parsing SQL process is as follows: 1, analyzer through rule analysis, Resolves the unresolvedrelation to jsonrelation. 2, through Parse,analyzer,optimizer finally get Jsonrelation (file:///path/to/shengli.json,1.0) 3, The Logicalplan is mapped to the physical plan Physicalrdd by sources Datasourcestrategy. 4. Physicalrdd contains rules for querying external data, and you can invoke the Execute () method to execute a spark query. Four, External Datasource interfaces in the first section I have introduced, the main interfaces, mainly look at Baserelation and Relationprovider. Such asWe want to implement an external data source, such as a Avro data source, that supports spark SQL operations Avro file. So long must define avrorelation to inherit baserelation. Also to achieve a relationprovider.
baserelation:is an abstraction of an external data source that contains the schema mapping and the rules for how to scan the data.
[Java]View PlainCopy
- Abstract class Baserelation {
- def Sqlcontext:sqlcontext
- def Schema:structtype
[Java]View PlainCopy
- Abstract class Prunedfilteredscan extends Baserelation {
- def buildscan (Requiredcolumns:array[string], Filters:array[filter]): Rdd[row]
- }
1, schema If we customize relation, we must rewrite the schema, that is, we must describe the schema for the external data source. 2. Buildscan we define how to query external data sources and provide 4 scan strategies, corresponding to 4 kinds of baserelation. We support 4 kinds of baserelation, divided into Tablescan, Prunedscan,prunedfilterscan,catalystscan. 1.
Tablescan: The default scan policy. 2.
Prunedscan: Here you can pass in the specified column, requiredcolumns, column clipping, and unnecessary columns are not loaded from the external data source. 3.
Prunedfilterscan: On the basis of the column clipping and adding the filter mechanism, the filter is filtered when the data is loaded, instead of the filter when the client request returns. 4.
Catalystscan: The catalyst supports incoming expressions for scan. Supports column cropping and filter.
Relationprovider:We are going to implement this, accept the arguments passed in after parse to generate the corresponding external Relation, which is a reflection of the production of external data source Relation interface.
[Java]View PlainCopy
- Trait Relationprovider {
- /**
- * Returns a new base relation with the given parameters.
- * note:the parameters ' keywords is case insensitive and this insensitivity is enforced
- * By the MAP which is passed to the function.
- */
- def createRelation (Sqlcontext:sqlcontext, parameters:map[string, String]): baserelation
- }
External DataSource Definition example after Spark1.2, JSON and parquet are also replaced by implementing the External API for external data source queries. The following is an example of how the external data source for JSON is defined as a description of how it is implemented:
1, JsonrelationDefinition processing for JSON files, schema and scan policies are based on JSONRDD, and details can be read Jsonrdd by themselves.
[Java]View PlainCopy
- Private[sql] case class Jsonrelation (Filename:string, samplingratio:double) (
- @transient val sqlcontext:sqlcontext)
- extends Tablescan {
- private Def Baserdd = SqlContext.sparkContext.textFile (fileName) //Read JSON file
- Override Val Schema =
- The Jsonrdd.inferschema ( ////Jsonrdd InferSchema method, which automatically recognizes the JSON schema, and the type types.
- Baserdd,
- Samplingratio,
- Sqlcontext.columnnameofcorruptrecord)
- Override Def buildscan () =
- Jsonrdd.jsonstringtorow (Baserdd, schema, Sqlcontext.columnnameofcorruptrecord) //This is still jsonrdd, Call Jsonstringtorow query to return row
- }
2, DefaultSourceCustom parameters such as path passed in the options can be obtained in parameters.
Here to accept the incoming parameters, come to paparazzi jsonrelation.
[Java]View PlainCopy
- Private[sql] class DefaultSource extends Relationprovider {
- /** Returns A new base relation with the given parameters. * *
- Override Def createRelation (
- Sqlcontext:sqlcontext,
- Parameters:map[string, String]): Baserelation = {
- Val fileName = parameters.getorelse ("path", Sys.error ("Option ' path ' not Specified"))
- Val samplingratio = Parameters.get ("Samplingratio"). Map (_.todouble). Getorelse (1.0)
- Jsonrelation (FileName, Samplingratio) (SqlContext)
- }
- }
Six, summary External datasource source analysis down, can be summed up as 3 parts. 1, the external data source registration process 2, the external Data source table query plan resolution process 3, how to customize an external data source, overriding Baserelation defines the schema of the external data source and scan rules. Defines relationprovider, how to generate an external data source relation. External DataSource This part of the API may also be changed in the subsequent build, currently only involves the query, about the other operations have not been covered. --eof--
Original articles, reproduced please specify:
Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory
This article link address: http://blog.csdn.net/oopsoom/article/details/42064075
Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.
Transferred from: http://blog.csdn.net/oopsoom/article/details/42064075
11th: Spark SQL Source Analysis External DataSource external data source