I. Introduction to Spark SQL External datasource
With the release of Spark1.2, Spark SQL began to formally support external data sources. Spark SQL opens up a series of interfaces for accessing external data sources to enable developers to implement them.
This allows spark SQL to support more types of data sources, such as JSON, Parquet, Avro, and CSV formats. As long as we want, we can develop arbitrary external data sources to connect to spark SQL. Previously, the support Hbase,cassandra can be seamlessly integrated using external data sources.
(Ps: About external DataSource Source parsing chapter, please move to: Spark SQL external DataSource external data source (ii) source analysis http://blog.csdn.net/oopsoom/article/ details/42064075)
Second, External DataSource
Take Spark1.2 JSON as an example, it supports the interface that has been changed to implement an external data source. So in addition to the previous API that we manipulate JSON, there is a way for DDL to create an external data source.
The operation of the parquetfile is similar to the following, it is not listed.
2.1 SQL Mode CREATE temporary TABLE USING OPTIONS
After Spark1.2, a table that creates an external data source is supported by the DDL syntax for create temporary table USING options.
CREATE Temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS ( path '/path/to/data.json ')
1. Operation Example:
Let's take example down the People.json file to do an example.
shengli-mac$ cat/users/shengli/git_repos/spark/examples/src/main/resources/people.json{"name": "Michael"} {"Name" : "Andy", "Age": 30}{"name": "Justin", "Age": 19}
2. DDL create external Data source table jsontable:
14/12/21 16:32:14 INFO repl. sparkiloop:created Spark context. Spark context available as sc.scala> import Org.apache.spark.sql.SQLContextimport org.apache.spark.sql.sqlcontextscala> val sqlcontext = new SqlContext (SC) SqlContext: Org.apache.spark.sql.SQLContext = [email protected]scala> Import Sqlcontext._import sqlcontext._// Create a jsontable external data source table, and specify that its number data source file is People.json this JSON file, and also specify an implicit conversion class that uses Org.apache.spark.sql.json that type (this follow-up article describes) scala> Val Jsonddl = S "" "| | CREATE Temporary TABLE jsontable | | USING Org.apache.spark.sql.json | | OPTIONS (| | path ' File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json ' | |) "". stripmarginjsonddl:string = "CREATE temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS (path ' File:///Use Rs/shengli/git_repos/spark/examples/src/main/resources/people.json ') "Scala> sqlContext.sql (JsonDDL). Collect ( )//Create the External Data source table JSONTABLE14/12/21 16:44:27 INFO Scheduler. Dagscheduler:job 0 Finished:reduce AT jsonrdd.scala:57, took 0.204461 sres0:array[org.apache.spark.sql.row] = Array ()
Let's take a look at the Schemardd:
scala> val jsonschema = Sqlcontext.sql (jsonddl) JsonSchema:org.apache.spark.sql.SchemaRDD = schemardd[7] at the RDD at Sch emardd.scala:108== Query Plan = = Physical Plan ==executedcommand (createtableusing jsontable, Org.apache.spark.sql.json, Map (Path-File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json))
Executedcommand to fetch the data in Spark.sql.json way from path to jsontable. Related to the class is createtableusing, follow-up source analysis will talk about.
Plan of implementation for each phase:
Scala> sqlcontext.sql ("SELECT * from Jsontable"). queryexecutionres6:org.apache.spark.sql.sqlcontext# Queryexecution = = = = Parsed Logical plan = = ' Project [*] ' unresolvedrelation None, jsontable, none== Analyzed Logical plan = =project [age#0,name#1] relation[age#0,name#1] Jsonrelation (file:///Users/shengli/git_repos/spark/examples/src/ main/resources/people.json,1.0) = = Optimized Logical Plan ==relation[age#0,name#1] Jsonrelation (file:///Users/ shengli/git_repos/spark/examples/src/main/resources/people.json,1.0) = = Physical Plan ==PhysicalRDD [age#0,name#1], MAPPARTITIONSRDD[27] at map at Jsonrdd.scala:47code generation:false== RDD = =
At this point, creating the load external data source to spark SQL has been completed.
We can use any of the ways we want to query:
3, SQL Query method:
Scala> sqlcontext.sql ("SELECT * from jsontable") 16:52:13 INFO Spark. Sparkcontext:created broadcast 6 from textfile at jsonrelation.scala:39res2:org.apache.spark.sql.schemardd = SchemaRDD At RDD at schemardd.scala:108== Query plan = physical plan ==physicalrdd [age#2,name#3], mappartitionsrdd[24] at M AP at jsonrdd.scala:47
To execute a query:
Scala> sqlcontext.sql ("SELECT * from Jsontable"). Collect () Res1:array[org.apache.spark.sql.row] = Array ([NULL, Michael], [30,andy], [19,justin])
2.2 API Mode
Sqlcontext.jsonfile
scala> val json = Sqlcontext.jsonfile ("file:///Users/shengli/git_repos/spark/examples/src/main/resources/ People.json ") scala> json.registertemptable (" jsonfile ") scala> SQL (" SELECT * from Jsonfile "). Collect () Res2: Array[org.apache.spark.sql.row] = Array ([Null,michael], [30,andy], [19,justin])
Iii. Summary
In general, Spark SQL is trying to get closer to a variety of data sources and wants spark SQL to integrate with many other types of data sources.
Spark SQL provides a DDL syntax for creating tables that load external data sources: Create temporary table USING OPTIONS
Spark SQL is open to a range of expansion interfaces that enable access to different data sources, such as Avro, CSV, Parquet,json, etc, by implementing these interfaces.
--the end--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/42061077
Spark SQL external DataSource external Data source (a) example