Reference: Https://spark.apache.org/docs/latest/sql-programming-guide.html#overview
http://www.csdn.net/article/2015-04-03/2824407
Spark SQL is a spark module for structured data processing. IT provides a programming abstraction called Dataframes and can also act as distributed SQL query engine.
1) in Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.
2) A DataFrame can operated on as normal RDDs and can also be registered as A temporary table. Registering a DataFrame as a table allows you to run SQL queries over it data.
3) The sql
function on a SQLContext
enables applications to run SQL queries programmatically and returns the result as a DataFrame
.
Val df = Sqlcontext.sql ("Select * from table") //sql interface
Create Dataframes:
With a SQLContext
, the applications can create DataFrame
s from RDD
the existing, from a Hive table, or from data sources.
// An existing sparkcontext. New Org.apache.spark.sql.SQLContext (SC) // This was used to implicitly convert an RDD to a DataFrame. Import Sqlcontext.implicits._
The case class defines the schema of the table=>the names of the arguments to the case class is read using reflection and become the names of the Columns=>this RDD can be implicitly converted to a DataFrame and then be registered as a TA Ble=>tables can used in subsequent SQL statements.
//SC is an existing sparkcontext.Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Importsqlcontext.implicits._//Define the schema using a case class.//note:case classes in Scala 2.10 can support-only up to To work around this limit,//You can use custom classes that implement the Product interface. Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (",")). Map (p = person (p (0), p (1). trim.toint). TODF () people.registertemptable ("People")//SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = sqlcontext.sql ("Select name from people, WHERE age >= and age <= 19")//The results of SQL queries is dataframes and support all the normal RDD operations.//The columns of a row in the result can be accessed by ordinal.Teenagers.map (t = "Name:" + t (0)). Collect (). foreach (println)
-
- Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method was more verbose, it allows the construct dataframes when the columns and their types was not known UN Til runtime.
Val df = sqlcontext.load ("People.json", "JSON") df.select ("name", "Age"). Save ("Namesandages.parquet", "parquet")
Or
Val df = sqlcontext.jsonfile ("Examples/src/main/resources/people.json")
Spark Learning Notes: (iii) Spark SQL