Spark SQL data source

Last Update:2018-06-15 Source: Internet

Author: User

Tags hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sparksql data sources: creating dataframe from a variety of data sources

Because the spark sql,dataframe,datasets are all shared with the Spark SQL Library, all three share the same code optimization, generation, and execution process, so Sql,dataframe,datasets's entry is sqlcontext.

There are a number of data sources that you can use to create spark dataframe:

Sparksql Data Source: RDD

Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class.     Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () Val people=SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1) . Trim.toint) Sqlcontext.createdataframe (people)

Sparksql Data Source: Hive

When data is read from hive, Spark SQL supports any storage format supported by hive (SerDe), including files, Rcfiles, ORC, parquet, Avro, and Protocol Buffer (of course Spark SQL can also read these files directly).

To connect to a deployed hive, you need to copy Hive-site.xml, Core-site.xml, Hdfs-site.xml to Spark's./conf/Directory

If you do not want to connect to an existing hive, you can do nothing directly using Hivecontext:

Spark SQL creates its own hive metadata warehouse in the current working directory, called metastore_db

If you try to create a table using a CREATE TABLE (not a Create EXTERNAL table) statement in HIVEQL, these tables will be placed in the/user/hive/warehouse directory in your default file system (if your classpath In the Hdfs-site.xml, the default file system is HDFs, otherwise it is the local file system.

Sparksql data Source: Hive Read-Write

//SC is an existing sparkcontext.Val SqlContext=NewOrg.apache.spark.sql.hive.HiveContext (SC) sqlcontext.sql ("CREATE TABLE IF not EXISTS src (key INT, value STRING)") Sqlcontext.sql ("LOAD DATA LOCAL inpath ' examples/src/main/resources/kv1.txt ' into TABLE src")//Queries is expressed in HiveQLSqlcontext.sql ("From src SELECT key, value"). Collect ().foreach(println)

Sparksql data Source: Access different versions of Metastore

Starting with Spark1.4, Spark SQL can query different versions by modifying the configuration. Hive metastores (no recompilation)

Sparksql Data Source: Parquet

Parquet (http://parquet.apache.org/) is a popular columnstore format that effectively stores records with nested fields.

The parquet format is often used in the Hadoop ecosystem, and it also supports all of the data types of spark SQL. Spark SQL provides a way to read and store parquet format files directly.

Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class.         Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people =SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () People.write.parquet ("xxxx") Val Parquetfile= SqlContext.read.parquet ("People.parquet")//Parquet files can also is registered as tables and then used in SQL statements.Parquetfile.registertemptable ("Parquetfile") Val Teenagers= Sqlcontext.sql ("SELECT name from Parquetfile WHERE age >= and <=") Teenagers.map (t="Name:"+ t (0). Collect ().foreach(println)

Sparksql data Source: Parquet--Partition Discovery

Partitioning tables are typically used in hive to optimize performance, such as:

SQLContext.read.parquet or SQLContext.read.load simply specify that Path/to/table,sparksql will automatically extract the partition information from the path, and the schema of the returned Dataframe will be:

Of course you can use the hive Read method:

Hivecontext.sql ("fromsrc SELECT key, value").

Sparksql Data Source: Json

Sparksql supports reading data from a JSON file or a JSON-formatted RDD

New Org.apache.spark.sql.SQLContext (SC)

//can be a directory or a folderVal Path ="Examples/src/main/resources/people.json"val People=SqlContext.read.json (path)//The inferred schema can be visualized using the Printschema () method.People.printschema ()//Register This DataFrame as a table.People.registertemptable ("people")       //SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = Sqlcontext.sql ("SELECT name from people WHERE age >= and <=")           //Alternatively, a DataFrame can be created-a JSON dataset represented by//An rdd[string] storing one JSON object per String.Val Anotherpeoplerdd = Sc.parallelize ("""{"Name":"Yin","Address":{"City":"Columbus","State":"Ohio"}}""":: Nil) Val anotherpeople= SqlContext.read.json (Anotherpeoplerdd)

Sparksql Data Source: JDBC

Val jdbcdf = SqlContext.read.format ("jdbc")                . Options (Map ("  URL""jdbc:postgresql:dbserver","dbtable " " Schema.tablename " ) )                . Load ()

Supported parameters:

Spark SQL data source

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More