Last week Spark1.2 just released, the weekend at home nothing, to understand this feature, by the way to analyze the source code, see how this feature is designed and implemented.
/** Spark SQL Source Analysis series Article * /
(Ps:external datasource Use article address: Spark SQL External DataSource External Data source (a) example http://blog.csdn.net/oopsoom/article/details/ 42061077)
First, sources package core
Spark SQL provides the external DataSource API in Spark1.2, which enables developers to implement their own external data sources, such as Avro, CSV, JSON, parquet, etc., based on the interface.
In the Org/spark/sql/sources directory of the Spark SQL source code, we'll see the relevant code for external datasource. Here is a special introduction to several:
1, Ddlparser
Specifically responsible for parsing external data source SQL Sqlparser, parsing create temporary table XXX using options (key ' value ', key ' value ') creates statements that load external data source tables.
Protected lazy val Createtable:parser[logicalplan] = CREATE ~ temporary ~ TABLE ~> ident ~ (USING ~> className) ~ (Options ~> Options) ^^ {case tableName ~ Provider ~ opts = createtableusing (tableName, provider, opts) }
2, Createtableusing
A runnablecommand that instantiates relation from the external data source lib by reflection, and then registers to the temp table.
Private[sql] Case Class createtableusing (tablename:string, provider:string,//Org.apache.spark.sql.json opt Ions:map[string, String]) extends Runnablecommand {def run (sqlcontext:sqlcontext) = {val Loader = Utils.getcontext Orsparkclassloader val Clazz:class[_] = try Loader.loadclass (provider) Catch {//do reflection case Cnf:java.lan G.classnotfoundexception = Try Loader.loadclass (provider + ". DefaultSource ") Catch {case cnf:java.lang.ClassNotFoundException = Sys.error (S" Failed to load CL The "for data Source: $provider")} val DataSource = Clazz.newinstance (). asinstanceof[org.apache.spark.sql.s Ources. Relationprovider]//json Package Defaultdatasource val relation = datasource.createrelation (SqlContext, new Caseinsensitivemap (options)//Create Jsonrelation Sqlcontext.baserelationtoschemardd (relation). Registertemptable ( TableName)//Register Seq.empty}}
2, Datasourcesstrategy
In the strategy article, I've talked about the role of Streategy to plan for generating physical plans. This provides a strategy specifically for parsing external data sources.
In the end, different Physicalrdd will be produced according to different baserelation. The scan policies for different baserelation are described below.
Private[sql] Object Datasourcestrategy extends strategy { def apply (Plan:logicalplan): Seq[sparkplan] = plan Match { C2/>case physicaloperation (Projectlist, filters, l @ logicalrelation (t:catalystscan)) = = Prunefilterprojectraw ( L, projectlist, filters, (A, f) = T.buildscan (A, F)): Nil ... Case L @ logicalrelation (t:tablescan) = execution. Physicalrdd (L.output, T.buildscan ()):: Nil case _ = Nil }
3, Interfaces.scala
This file defines a series of extensible external data source interfaces that we only need to implement for external data sources that you want to access. There are more important trait relationprovider and baserelation, which are described in detail below.
4, Filters.scala
This filter defines how to filter when external data sources are loaded. Note that it is time to load the external data source into the table instead of the filter in Spark. This is a bit like the coprocessor of HBase, the query filter is done on the server, does not filter on the client side.
5, Logicalrelation
Encapsulates the Baserelation, inherits the Catalyst's Leafnode, realizes the multiinstancerelation.
Ii. External DataSource Registration process use Spark SQL Sql/json to do an example, draw a flowchart, as follows:
process for registering tables for external data sources:1. Provide an external data source file, such as a JSON file. 2. Provide a class library that implements the interfaces required for an external data source, such as a JSON packet under SQL, and a external datasource implementation after the 1.2 version. 3, introduce SqlContext, use DDL to create table, such as creation temporary table XXX using options (key ' value ', key ' value ') 4, External datasource Ddlpar SER will Parse5 the SQL and parse it into an object of the Createtableusing class. The class is a runnablecommand, and its Run method executes the CREATE TABLE statement directly. 6. The class creates a org.apache.spark.sql.sources.RelationProvider through reflection, the trait definition to createrelation, such as JSON, to create jsonrelation, if Avro, The avrorelation is created. 7, Get External releation, directly call SqlContext Baserelationtoschemardd converted to SchemaRDD8, finally registertemptable (TableName) To register as a table, you can use SQL to query.
Third, External datasource analysis process First look at the diagram, the figure is as follows:
The Spark SQL parsing SQL process is as follows: 1. analyzer resolves the unresolvedrelation to jsonrelation by rule resolution. 2, through Parse,analyzer,optimizer finally get Jsonrelation (file:///path/to/shengli.json,1.0) 3, The Logicalplan is mapped to the physical plan Physicalrdd by sources Datasourcestrategy. 4. Physicalrdd contains rules for querying external data, and you can invoke the Execute () method to execute a spark query.
Four, External Datasource interfaces in the first section I have introduced, the main interfaces, mainly look at Baserelation and Relationprovider. If we want to implement an external data source, such as a Avro data source, support Spark SQL operations Avro file. So long must define avrorelation to inherit baserelation. Also to achieve a relationprovider.
baserelation:is an abstraction of an external data source that containsMap of SchemaAndrules for how to scan data。
Abstract class Baserelation { def sqlcontext:sqlcontext def Schema:structtype
Abstract class Prunedfilteredscan extends Baserelation { def buildscan (requiredcolumns:array[string], filters: Array[filter]): Rdd[row]}
1, schema If we customize relation, we must rewrite the schema, that is, we must describe the schema for the external data source. 2. Buildscan we define how to query external data sources and provide 4 scan strategies, corresponding to 4 kinds of baserelation.
We support 4 kinds of baserelation, divided into Tablescan, Prunedscan,prunedfilterscan,catalystscan. 1.
Tablescan:
The default scan policy. 2.
Prunedscan: Here you can pass in the specified column, requiredcolumns, column clipping, and unnecessary columns are not loaded from the external data source. 3.
Prunedfilterscan: On the basis of the column clipping and adding the filter mechanism, the filter is filtered when the data is loaded, instead of the filter when the client request returns. 4.
Catalystscan: The catalyst supports incoming expressions for scan. Supports column cropping and filter.
Relationprovider:We are going to implement this, accept the arguments passed in after parse to generate the corresponding external Relation, which is a reflection of the production of external data source Relation interface.
Trait Relationprovider { /** * Returns A new base relation with the given parameters. * note:the parameters ' keywords is case insensitive and this insensitivity was enforced * by the MAP which is passed T o the function. * /def createRelation (Sqlcontext:sqlcontext, parameters:map[string, String]): Baserelation}
External DataSource Definition example after Spark1.2, JSON and parquet are also replaced by implementing the External API for external data source queries. The following is an example of how the external data source for JSON is defined as a description of how it is implemented:
1, Jsonrelation
Definition processing for JSON files, schema and scan policies are based on JSONRDD, and details can be read Jsonrdd by themselves.
Private[sql] Case Class Jsonrelation (Filename:string, samplingratio:double) ( @transient val sqlcontext: SqlContext) extends Tablescan { private def Baserdd = SqlContext.sparkContext.textFile (fileName)//Read JSON file override val schema = The Jsonrdd.inferschema ( ////Jsonrdd InferSchema method, which automatically recognizes the JSON schema, and the type types. Baserdd, samplingratio, Sqlcontext.columnnameofcorruptrecord) override def buildscan () = Jsonrdd.jsonstringtorow (Baserdd, schema, Sqlcontext.columnnameofcorruptrecord)//This is still jsonrdd, Call Jsonstringtorow query to return row}
2, DefaultSourceCustom parameters such as path passed in the options can be obtained in parameters.
Here to accept the incoming parameters, come to paparazzi jsonrelation.
Private[sql] class DefaultSource extends Relationprovider { /** Returns a new base relation with the given parameters. * /override def createRelation ( sqlcontext:sqlcontext, parameters:map[string, String]): Baserelation = { val fileName = parameters.getorelse ("Path", Sys.error ("Option ' path ' not Specified")) val samplingratio = Parameters.get ("Samplingratio"). Map (_.todouble). Getorelse (1.0) jsonrelation (FileName, Samplingratio) ( sqlcontext) }}
Six, summary External datasource source analysis down, can be summed up as 3 parts. 1, the external data source registration process 2, the external Data source table query plan resolution process 3, how to customize an external data source, overriding Baserelation defines the schema of the external data source and scan rules. Defines relationprovider, how to generate an external data source relation. External DataSource This part of the API may also be changed in the subsequent build, currently only involves the query, about the other operations have not been covered. --eof--original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/42064075
Spark SQL External DataSource external data source (ii) Source code analysis