Spark SQL external DataSource external Data source (a) example

Last Update:2014-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction to Spark SQL External datasource

With the release of Spark1.2, Spark SQL began to formally support external data sources. Spark SQL opens up a series of interfaces for accessing external data sources to enable developers to implement them.

This allows spark SQL to support more types of data sources, such as JSON, Parquet, Avro, and CSV formats. As long as we want, we can develop arbitrary external data sources to connect to spark SQL. Previously, the support Hbase,cassandra can be seamlessly integrated using external data sources.

(Ps: About external DataSource Source parsing chapter, please move to: Spark SQL external DataSource external data source (ii) source analysis http://blog.csdn.net/oopsoom/article/ details/42064075)

Second, External DataSource

Take Spark1.2 JSON as an example, it supports the interface that has been changed to implement an external data source. So in addition to the previous API that we manipulate JSON, there is a way for DDL to create an external data source.

The operation of the parquetfile is similar to the following, it is not listed.

2.1 SQL Mode CREATE temporary TABLE USING OPTIONS

After Spark1.2, a table that creates an external data source is supported by the DDL syntax for create temporary table USING options.

CREATE Temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS (  path '/path/to/data.json ')

1. Operation Example:

Let's take example down the People.json file to do an example.

shengli-mac$ cat/users/shengli/git_repos/spark/examples/src/main/resources/people.json{"name": "Michael"} {"Name" : "Andy", "Age": 30}{"name": "Justin", "Age": 19}

2. DDL create external Data source table jsontable:

14/12/21 16:32:14 INFO repl. sparkiloop:created Spark context. Spark context available as sc.scala> import Org.apache.spark.sql.SQLContextimport org.apache.spark.sql.sqlcontextscala> val sqlcontext = new SqlContext (SC) SqlContext: Org.apache.spark.sql.SQLContext = [email protected]scala> Import Sqlcontext._import sqlcontext._// Create a jsontable external data source table, and specify that its number data source file is People.json this JSON file, and also specify an implicit conversion class that uses Org.apache.spark.sql.json that type (this follow-up article describes) scala> Val Jsonddl = S "" "| | CREATE Temporary TABLE jsontable | | USING Org.apache.spark.sql.json | | OPTIONS (| | path ' File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json ' | |) "". stripmarginjsonddl:string = "CREATE temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS (path ' File:///Use Rs/shengli/git_repos/spark/examples/src/main/resources/people.json ') "Scala> sqlContext.sql (JsonDDL). Collect ( )//Create the External Data source table JSONTABLE14/12/21 16:44:27 INFO Scheduler. Dagscheduler:job 0 Finished:reduce AT jsonrdd.scala:57, took 0.204461 sres0:array[org.apache.spark.sql.row] = Array ()

Let's take a look at the Schemardd:

scala> val jsonschema = Sqlcontext.sql (jsonddl) JsonSchema:org.apache.spark.sql.SchemaRDD = schemardd[7] at the RDD at Sch emardd.scala:108== Query Plan = = Physical Plan ==executedcommand (createtableusing jsontable, Org.apache.spark.sql.json, Map (Path-File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json))

Executedcommand to fetch the data in Spark.sql.json way from path to jsontable. Related to the class is createtableusing, follow-up source analysis will talk about.

Plan of implementation for each phase:

Scala> sqlcontext.sql ("SELECT * from Jsontable"). queryexecutionres6:org.apache.spark.sql.sqlcontext# Queryexecution = = = = Parsed Logical plan = = ' Project [*] ' unresolvedrelation None, jsontable, none== Analyzed Logical plan = =project [age#0,name#1] relation[age#0,name#1] Jsonrelation (file:///Users/shengli/git_repos/spark/examples/src/ main/resources/people.json,1.0) = = Optimized Logical Plan ==relation[age#0,name#1] Jsonrelation (file:///Users/ shengli/git_repos/spark/examples/src/main/resources/people.json,1.0) = = Physical Plan ==PhysicalRDD [age#0,name#1], MAPPARTITIONSRDD[27] at map at Jsonrdd.scala:47code generation:false== RDD = =

At this point, creating the load external data source to spark SQL has been completed.

We can use any of the ways we want to query:

3, SQL Query method:

Scala> sqlcontext.sql ("SELECT * from jsontable") 16:52:13 INFO Spark. Sparkcontext:created broadcast 6 from textfile at jsonrelation.scala:39res2:org.apache.spark.sql.schemardd = SchemaRDD At RDD at schemardd.scala:108== Query plan = physical plan ==physicalrdd [age#2,name#3], mappartitionsrdd[24] at M AP at jsonrdd.scala:47

To execute a query:

Scala> sqlcontext.sql ("SELECT * from Jsontable"). Collect () Res1:array[org.apache.spark.sql.row] = Array ([NULL, Michael], [30,andy], [19,justin])

2.2 API Mode

Sqlcontext.jsonfile

scala> val json = Sqlcontext.jsonfile ("file:///Users/shengli/git_repos/spark/examples/src/main/resources/ People.json ") scala> json.registertemptable (" jsonfile ") scala> SQL (" SELECT * from Jsonfile "). Collect () Res2: Array[org.apache.spark.sql.row] = Array ([Null,michael], [30,andy], [19,justin])

Iii. Summary

In general, Spark SQL is trying to get closer to a variety of data sources and wants spark SQL to integrate with many other types of data sources.

Spark SQL provides a DDL syntax for creating tables that load external data sources: Create temporary table USING OPTIONS

Spark SQL is open to a range of expansion interfaces that enable access to different data sources, such as Avro, CSV, Parquet,json, etc, by implementing these interfaces.

--the end--

Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/42061077

Spark SQL external DataSource external Data source (a) example

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark SQL external DataSource external Data source (a) example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark SQL external DataSource external Data source (a) example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support