Sparksql data sources: creating dataframe from a variety of data sources
Because the spark sql,dataframe,datasets are all shared with the Spark SQL Library, all three share the same code optimization, generation, and execution process, so Sql,dataframe,datasets's entry is sqlcontext.
There are a number of data sources that you can use to create spark dataframe:
Sparksql Data Source: RDD
Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class. Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () Val people=SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1) . Trim.toint) Sqlcontext.createdataframe (people)
Sparksql Data Source: Hive
When data is read from hive, Spark SQL supports any storage format supported by hive (SerDe), including files, Rcfiles, ORC, parquet, Avro, and Protocol Buffer (of course Spark SQL can also read these files directly).
To connect to a deployed hive, you need to copy Hive-site.xml, Core-site.xml, Hdfs-site.xml to Spark's./conf/Directory
If you do not want to connect to an existing hive, you can do nothing directly using Hivecontext:
Spark SQL creates its own hive metadata warehouse in the current working directory, called metastore_db
If you try to create a table using a CREATE TABLE (not a Create EXTERNAL table) statement in HIVEQL, these tables will be placed in the/user/hive/warehouse directory in your default file system (if your classpath In the Hdfs-site.xml, the default file system is HDFs, otherwise it is the local file system.
Sparksql data Source: Hive Read-Write
//SC is an existing sparkcontext.Val SqlContext=NewOrg.apache.spark.sql.hive.HiveContext (SC) sqlcontext.sql ("CREATE TABLE IF not EXISTS src (key INT, value STRING)") Sqlcontext.sql ("LOAD DATA LOCAL inpath ' examples/src/main/resources/kv1.txt ' into TABLE src")//Queries is expressed in HiveQLSqlcontext.sql ("From src SELECT key, value"). Collect ().foreach(println)
Sparksql data Source: Access different versions of Metastore
Starting with Spark1.4, Spark SQL can query different versions by modifying the configuration. Hive metastores (no recompilation)
Sparksql Data Source: Parquet
Parquet (http://parquet.apache.org/) is a popular columnstore format that effectively stores records with nested fields.
The parquet format is often used in the Hadoop ecosystem, and it also supports all of the data types of spark SQL. Spark SQL provides a way to read and store parquet format files directly.
Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Import Sqlcontext.implicits._//Define the schema using a case class. Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people =SC. textfile ("Examples/src/main/resources/people.txt"). Map (_.split (","). Map (p ~ = person (p) (0), P (1). trim.toint). TODF () People.write.parquet ("xxxx") Val Parquetfile= SqlContext.read.parquet ("People.parquet")//Parquet files can also is registered as tables and then used in SQL statements.Parquetfile.registertemptable ("Parquetfile") Val Teenagers= Sqlcontext.sql ("SELECT name from Parquetfile WHERE age >= and <=") Teenagers.map (t="Name:"+ t (0). Collect ().foreach(println)
Sparksql data Source: Parquet--Partition Discovery
Partitioning tables are typically used in hive to optimize performance, such as:
SQLContext.read.parquet or SQLContext.read.load simply specify that Path/to/table,sparksql will automatically extract the partition information from the path, and the schema of the returned Dataframe will be:
Of course you can use the hive Read method:
Hivecontext.sql ("fromsrc SELECT key, value").
Sparksql Data Source: Json
Sparksql supports reading data from a JSON file or a JSON-formatted RDD
New Org.apache.spark.sql.SQLContext (SC)
//can be a directory or a folderVal Path ="Examples/src/main/resources/people.json"val People=SqlContext.read.json (path)//The inferred schema can be visualized using the Printschema () method.People.printschema ()//Register This DataFrame as a table.People.registertemptable ("people") //SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = Sqlcontext.sql ("SELECT name from people WHERE age >= and <=") //Alternatively, a DataFrame can be created-a JSON dataset represented by//An rdd[string] storing one JSON object per String.Val Anotherpeoplerdd = Sc.parallelize ("""{"Name":"Yin","Address":{"City":"Columbus","State":"Ohio"}}""":: Nil) Val anotherpeople= SqlContext.read.json (Anotherpeoplerdd)
Sparksql Data Source: JDBC
Val jdbcdf = SqlContext.read.format ("jdbc") . Options (Map (" URL""jdbc:postgresql:dbserver","dbtable " " Schema.tablename " ) ) . Load ()
Supported parameters:
Spark SQL data source