Spark SQL external DataSource external Data source (a) example

Source: Internet
Author: User

I. Introduction to Spark SQL External datasource

With the release of Spark1.2, Spark SQL began to formally support external data sources. Spark SQL opens up a series of interfaces for accessing external data sources to enable developers to implement them.

This allows spark SQL to support more types of data sources, such as JSON, Parquet, Avro, and CSV formats. As long as we want, we can develop arbitrary external data sources to connect to spark SQL. Previously, the support Hbase,cassandra can be seamlessly integrated using external data sources.

(Ps: About external DataSource Source parsing chapter, please move to: Spark SQL external DataSource external data source (ii) source analysis http://blog.csdn.net/oopsoom/article/ details/42064075)

Second, External DataSource

Take Spark1.2 JSON as an example, it supports the interface that has been changed to implement an external data source. So in addition to the previous API that we manipulate JSON, there is a way for DDL to create an external data source.

The operation of the parquetfile is similar to the following, it is not listed.

2.1 SQL Mode CREATE temporary TABLE USING OPTIONS

After Spark1.2, a table that creates an external data source is supported by the DDL syntax for create temporary table USING options.

CREATE Temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS (  path '/path/to/data.json ')

1. Operation Example:

Let's take example down the People.json file to do an example.

shengli-mac$ cat/users/shengli/git_repos/spark/examples/src/main/resources/people.json{"name": "Michael"} {"Name" : "Andy", "Age": 30}{"name": "Justin", "Age": 19}
2. DDL create external Data source table jsontable:

14/12/21 16:32:14 INFO repl. sparkiloop:created Spark context. Spark context available as sc.scala> import Org.apache.spark.sql.SQLContextimport org.apache.spark.sql.sqlcontextscala> val sqlcontext = new SqlContext (SC) SqlContext: Org.apache.spark.sql.SQLContext = [email protected]scala> Import Sqlcontext._import sqlcontext._// Create a jsontable external data source table, and specify that its number data source file is People.json this JSON file, and also specify an implicit conversion class that uses Org.apache.spark.sql.json that type (this follow-up article describes) scala> Val Jsonddl = S "" "| | CREATE Temporary TABLE jsontable | | USING Org.apache.spark.sql.json | | OPTIONS (| | path ' File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json ' | |) "". stripmarginjsonddl:string = "CREATE temporary TABLE jsontableusing org.apache.spark.sql.jsonOPTIONS (path ' File:///Use Rs/shengli/git_repos/spark/examples/src/main/resources/people.json ') "Scala> sqlContext.sql (JsonDDL). Collect ( )//Create the External Data source table JSONTABLE14/12/21 16:44:27 INFO Scheduler. Dagscheduler:job 0 Finished:reduce AT jsonrdd.scala:57, took 0.204461 sres0:array[org.apache.spark.sql.row] = Array () 

Let's take a look at the Schemardd:

scala> val jsonschema = Sqlcontext.sql (jsonddl) JsonSchema:org.apache.spark.sql.SchemaRDD = schemardd[7] at the RDD at Sch emardd.scala:108== Query Plan = = Physical Plan ==executedcommand (createtableusing jsontable, Org.apache.spark.sql.json, Map (Path-File:///Users/shengli/git_repos/spark/examples/src/main/resources/people.json))

Executedcommand to fetch the data in Spark.sql.json way from path to jsontable. Related to the class is createtableusing, follow-up source analysis will talk about.

Plan of implementation for each phase:

Scala> sqlcontext.sql ("SELECT * from Jsontable"). queryexecutionres6:org.apache.spark.sql.sqlcontext# Queryexecution = = = = Parsed Logical plan = = ' Project [*] ' unresolvedrelation None, jsontable, none== Analyzed Logical plan = =project [age#0,name#1] relation[age#0,name#1] Jsonrelation (file:///Users/shengli/git_repos/spark/examples/src/ main/resources/people.json,1.0) = = Optimized Logical Plan ==relation[age#0,name#1] Jsonrelation (file:///Users/ shengli/git_repos/spark/examples/src/main/resources/people.json,1.0) = = Physical Plan ==PhysicalRDD [age#0,name#1], MAPPARTITIONSRDD[27] at map at Jsonrdd.scala:47code generation:false== RDD = =

At this point, creating the load external data source to spark SQL has been completed.

We can use any of the ways we want to query:

3, SQL Query method:

Scala> sqlcontext.sql ("SELECT * from jsontable") 16:52:13 INFO Spark. Sparkcontext:created broadcast 6 from textfile at jsonrelation.scala:39res2:org.apache.spark.sql.schemardd = SchemaRDD At RDD at schemardd.scala:108== Query plan = physical plan ==physicalrdd [age#2,name#3], mappartitionsrdd[24] at M AP at jsonrdd.scala:47

To execute a query:

Scala> sqlcontext.sql ("SELECT * from Jsontable"). Collect () Res1:array[org.apache.spark.sql.row] = Array ([NULL, Michael], [30,andy], [19,justin])

2.2 API Mode

Sqlcontext.jsonfile

scala> val json = Sqlcontext.jsonfile ("file:///Users/shengli/git_repos/spark/examples/src/main/resources/ People.json ") scala> json.registertemptable (" jsonfile ") scala> SQL (" SELECT * from Jsonfile "). Collect () Res2: Array[org.apache.spark.sql.row] = Array ([Null,michael], [30,andy], [19,justin])

Iii. Summary

In general, Spark SQL is trying to get closer to a variety of data sources and wants spark SQL to integrate with many other types of data sources.

Spark SQL provides a DDL syntax for creating tables that load external data sources: Create temporary table USING OPTIONS

Spark SQL is open to a range of expansion interfaces that enable access to different data sources, such as Avro, CSV, Parquet,json, etc, by implementing these interfaces.

--the end--

Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/42061077


Spark SQL external DataSource external Data source (a) example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.