Spark Learning Notes: (iii) Spark SQL

Last Update:2015-06-12 Source: Internet

Author: User

Tags reflection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference: Https://spark.apache.org/docs/latest/sql-programming-guide.html#overview

http://www.csdn.net/article/2015-04-03/2824407

Spark SQL is a spark module for structured data processing. IT provides a programming abstraction called Dataframes and can also act as distributed SQL query engine.

1) in Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.

2) A DataFrame can operated on as normal RDDs and can also be registered as A temporary table. Registering a DataFrame as a table allows you to run SQL queries over it data.

3) The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame .

Val df = Sqlcontext.sql ("Select * from table")   //sql interface

Create Dataframes:

With a SQLContext , the applications can create DataFrame s from RDD the existing, from a Hive table, or from data sources.

// An existing sparkcontext. New Org.apache.spark.sql.SQLContext (SC) // This was used to implicitly convert an RDD to a DataFrame. Import Sqlcontext.implicits._

Different methods for converting existing RDDs into dataframes.
- The first method uses reflection to infer the schema of a RDD that contains specific types of objects.

The case class defines the schema of the table=>the names of the arguments to the case class is read using reflection and become the names of the Columns=>this RDD can be implicitly converted to a DataFrame and then be registered as a TA Ble=>tables can used in subsequent SQL statements.

//SC is an existing sparkcontext.Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Importsqlcontext.implicits._//Define the schema using a case class.//note:case classes in Scala 2.10 can support-only up to To work around this limit,//You can use custom classes that implement the Product interface. Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (",")). Map (p = person (p (0), p (1). trim.toint). TODF () people.registertemptable ("People")//SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = sqlcontext.sql ("Select name from people, WHERE age >= and age <= 19")//The results of SQL queries is dataframes and support all the normal RDD operations.//The columns of a row in the result can be accessed by ordinal.Teenagers.map (t = "Name:" + t (0)). Collect (). foreach (println)

- Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method was more verbose, it allows the construct dataframes when the columns and their types was not known UN Til runtime.

From data sources:

Val df = sqlcontext.load ("People.json", "JSON") df.select ("name", "Age"). Save ("Namesandages.parquet", "parquet")

Val df = sqlcontext.jsonfile ("Examples/src/main/resources/people.json")

Spark Learning Notes: (iii) Spark SQL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More