Spark Learning Notes: (iii) Spark SQL

Source: Internet
Author: User
Tags reflection

Reference: Https://spark.apache.org/docs/latest/sql-programming-guide.html#overview

http://www.csdn.net/article/2015-04-03/2824407

Spark SQL is a spark module for structured data processing. IT provides a programming abstraction called Dataframes and can also act as distributed SQL query engine.

1) in Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between Dataframe and Rdd is that the former has the schema meta information, that is, each column of the two-dimensional table dataset represented by Dataframe has a name and type. This allows spark SQL to gain insight into more structural information, optimizing the data sources hidden behind Dataframe and the transformations that work on dataframe, ultimately achieving the goal of significantly improving runtime efficiency. In view of the RDD, the Spark core can only perform simple, general-purpose pipeline optimizations at the stage level, since it is not possible to know the specific internal structure of the data elements that are stored.

2) A DataFrame can operated on as normal RDDs and can also be registered as A temporary table. Registering a DataFrame as a table allows you to run SQL queries over it data.

3) The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame .

Val df = Sqlcontext.sql ("Select * from table")   //sql interface

Create Dataframes:

With a SQLContext , the applications can create DataFrame s from RDD the existing, from a Hive table, or from data sources.

// An existing sparkcontext. New Org.apache.spark.sql.SQLContext (SC) // This was used to implicitly convert an RDD to a DataFrame. Import Sqlcontext.implicits._
    • Different methods for converting existing RDDs into dataframes.

      • The first method uses reflection to infer the schema of a RDD that contains specific types of objects.

The case class defines the schema of the table=>the names of the arguments to the case class is read using reflection and become the names of the Columns=>this RDD can be implicitly converted to a DataFrame and then be registered as a TA Ble=>tables can used in subsequent SQL statements.

//SC is an existing sparkcontext.Val SqlContext =NewOrg.apache.spark.sql.SQLContext (SC)//This was used to implicitly convert an RDD to a DataFrame.Importsqlcontext.implicits._//Define the schema using a case class.//note:case classes in Scala 2.10 can support-only up to To work around this limit,//You can use custom classes that implement the Product interface. Case classPerson (name:string, Age:int)//Create an RDD of the person objects and register it as a table.Val people = Sc.textfile ("Examples/src/main/resources/people.txt"). Map (_.split (",")). Map (p = person (p (0), p (1). trim.toint). TODF () people.registertemptable ("People")//SQL statements can be run by using the SQL methods provided by SqlContext.Val teenagers = sqlcontext.sql ("Select name from people, WHERE age >= and age <= 19")//The results of SQL queries is dataframes and support all the normal RDD operations.//The columns of a row in the result can be accessed by ordinal.Teenagers.map (t = "Name:" + t (0)). Collect (). foreach (println)

      • Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method was more verbose, it allows the construct dataframes when the columns and their types was not known UN Til runtime.
    • From data sources:
Val df = sqlcontext.load ("People.json", "JSON") df.select ("name", "Age"). Save ("Namesandages.parquet", "parquet") 

Or

Val df = sqlcontext.jsonfile ("Examples/src/main/resources/people.json")   

Spark Learning Notes: (iii) Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.