Spark SQL data source

Source: Internet
Author: User
Tags hadoop ecosystem

Sparksql data sources: creating dataframe from a variety of data sources

Because the spark sql,dataframe,datasets are all shared with the Spark SQL Library, all three share the same code optimization, generation, and execution process, so Sql,dataframe,datasets's entry is sqlcontext.

There are a number of data sources that you can use to create spark dataframe:

Sparksql Data Source: RDD

Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class.     Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () Val people=SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1) . Trim.toint) Sqlcontext.createdataframe (people)

Sparksql Data Source: Hive

When data is read from hive, Spark SQL supports any storage format supported by hive (SerDe), including files, Rcfiles, ORC, parquet, Avro, and Protocol Buffer (of course Spark SQL can also read these files directly).

To connect to a deployed hive, you need to copy Hive-site.xml, Core-site.xml, Hdfs-site.xml to Spark's./conf/Directory

If you do not want to connect to an existing hive, you can do nothing directly using Hivecontext:

Spark SQL creates its own hive metadata warehouse in the current working directory, called metastore_db

If you try to create a table using a CREATE TABLE (not a Create EXTERNAL table) statement in HIVEQL, these tables will be placed in the/user/hive/warehouse directory in your default file system (if your classpath In the Hdfs-site.xml, the default file system is HDFs, otherwise it is the local file system.

Sparksql data Source: Hive Read-Write

//SC is an existing sparkcontext.Val SqlContext=NewOrg.apache.spark.sql.hive.HiveContext (SC) sqlcontext.sql ("CREATE TABLE IF not EXISTS src (key INT, value STRING)") Sqlcontext.sql ("LOAD DATA LOCAL inpath ' examples/src/main/resources/kv1.txt ' into TABLE src")//Queries is expressed in HiveQLSqlcontext.sql ("From src SELECT key, value"). Collect ().foreach(println)

Sparksql data Source: Access different versions of Metastore

Starting with Spark1.4, Spark SQL can query different versions by modifying the configuration. Hive metastores (no recompilation)

Sparksql Data Source: Parquet

Parquet (http://parquet.apache.org/) is a popular columnstore format that effectively stores records with nested fields.

The parquet format is often used in the Hadoop ecosystem, and it also supports all of the data types of spark SQL. Spark SQL provides a way to read and store parquet format files directly.

Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class.         Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people =SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () People.write.parquet ("xxxx") Val Parquetfile= SqlContext.read.parquet ("People.parquet")//Parquet files can also is registered as tables and then used in SQL statements.Parquetfile.registertemptable ("Parquetfile") Val Teenagers= Sqlcontext.sql ("SELECT name from Parquetfile WHERE age >= and <=") Teenagers.map (t="Name:"+ t (0). Collect ().foreach(println)

Sparksql data Source: Parquet--Partition Discovery

Partitioning tables are typically used in hive to optimize performance, such as:

SQLContext.read.parquet or SQLContext.read.load simply specify that Path/to/table,sparksql will automatically extract the partition information from the path, and the schema of the returned Dataframe will be:

Of course you can use the hive Read method:

Hivecontext.sql ("fromsrc SELECT key, value").

Sparksql Data Source: Json

Sparksql supports reading data from a JSON file or a JSON-formatted RDD

New Org.apache.spark.sql.SQLContext (SC)
//can be a directory or a folderVal Path ="Examples/src/main/resources/people.json"val People=SqlContext.read.json (path)//The inferred schema can be visualized using the Printschema () method.People.printschema ()//Register This DataFrame as a table.People.registertemptable ("people")       //SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = Sqlcontext.sql ("SELECT name from people WHERE age >= and <=")           //Alternatively, a DataFrame can be created-a JSON dataset represented by//An rdd[string] storing one JSON object per String.Val Anotherpeoplerdd = Sc.parallelize ("""{"Name":"Yin","Address":{"City":"Columbus","State":"Ohio"}}""":: Nil) Val anotherpeople= SqlContext.read.json (Anotherpeoplerdd)

Sparksql Data Source: JDBC

Val jdbcdf = SqlContext.read.format ("jdbc")                . Options (Map ("  URL""jdbc:postgresql:dbserver","dbtable " " Schema.tablename " ) )                . Load ()

Supported parameters:

Spark SQL data source

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.