Summary of Spark SQL and Dataframe Learning

Last Update:2016-05-12 Source: Internet

Author: User

Tags reflection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, DataFrame
A distributed dataset that is organized as a named column. Conceptually equivalent to a table in a relational database or data frame data structure in R/python, but Dataframe is rich in optimizations. Before Spark 1.3, the new core type is Rdd-schemardd and is now changed to Dataframe. Spark operates a large number of data sources through Dataframe, including external files (such as JSON, Avro, Parquet, Sequencefile, and so on), hive, relational databases, Cassandra, and so on.

Dataframe and Rdd differences:
The Rdd, in record, does not have the ability to gain insight into the details of the record, to optimize it in depth, to limit sparksql performance, and Dataframe contains metadata metadata information for each record. Optimization of the dataframe can be optimized for the column internals.

(1) Creation of Dataframe
The entry point for all related features in Spark is the SqlContext class or its subclasses, and all that is needed to create a sqlcontext is just a sparkcontext.

Val Sc:sparkcontext
Val sqlcontext = new Org.apache.spark.sql.SQLContext (SC)
Import Sqlcontext.implicits._
In addition to a basic sqlcontext, you can also create a hivecontext that supports a superset of the features supported by Basic SqlContext. Its additional functionality includes the ability to write queries with a more complete HiveQL parser to access the HIVEUDFS, and to read data from Hive tables. With Hivecontext you don't need an already existing Hive to turn on, SqlContext available data sources are also available to Hivecontext.

With SqlContext, apps can be created from an existing RDD, hive table, or data source datasources dataframe
Example: creating from a local JSON file

Val df = sqlcontext.jsonfile ("File:///home/hdfs/people.json")
Df.show ()
Age Name
Null Michael
Andy
Justin
Df.printschema ()
|–age:long (nullable = True)
|–name:string (nullable = True)

(2) operation of Dataframe
Dataframe supports an RDD series operation that allows you to filter tables and correlate multiple tables

Df.select ("name"). Show ()
Name
Michael
Andy
Justin
Df.select (DF ("name"), DF ("age") +1). Show ()
Name (age + 1)
Michael NULL
Andy 31
Justin 20
Df.filter (DF ("age") >21). Select ("name"). Show ()
Name
Andy
Df.groupby ("Age"). Count (). Show ()
Age Count
Null 1
19 1
30 1
Connections between tables, 3 equals
Df.join (DF2,DF ("name") = = = DF2 ("name"), "left"). Show ()

Df.filter ("Age > 30")
. Join (Department, DF ("deptid") = = = Department ("id"))
. GroupBy (Department ("name"), "gender")
. AGG (AVG (DF ("salary")), Max (DF ("Age")))

2. Data sources in Sparksql

Spark SQL supports the operation of various data sources through the Schemardd interface. A schemardd can be manipulated as a generic rdd, or it can be registered as a temporary table. Registering a schemardd as a table allows you to run SQL queries on their data.
Loading data into a variety of SCHEMARDD data sources, including Rdds, parquent files (columnstore), JSON datasets, hive tables, the following mainly describes the two methods of converting Rdds to Schemardd
(1) Using reflection inference mode
Use reflection to infer the pattern (schema) of the RDD containing a particular object type. For writing spark programs, you already know the pattern and using reflection can make the code simple. Combined with the name of the sample, read by reflection, as the name of the column. This rdd can be implicitly converted to a schemardd and then registered as a table. Tables can be used in subsequent SQL statements.

Val sqlcontext = new Org. Apache. Spark. SQL. SqlContext(SC) Import SqlContext. Implicits. _case class Person (name:string,age:int) val people = SC. Textfile("File:///home/hdfs/people.txt"). Map(_. Split(",")). Map(p = person (p) (0), P (1). Trim. ToInt)). TODF() people. Registertemptable("People") Val teenagers = SqlContext. SQL("Select Name,age from people WHERE age>= and age <=30") Teenagers. Map(t ="Name:"+t (0)). Collect(). foreach(println) Teenagers. Map(t ="Name:"+ t. Getas[String] ("Name")). Collect(). foreach(println) Teenagers. Map(_. Getvaluemap[Any] (List ("Name","Age"))). Collect(). foreach(println)

(2) Programming specified mode
It is implemented through a programming interface construction pattern, which can then be used on existing rdds. Suitable for current sample mode unknown
A schemardd can be created in three steps.

Create a line of Rdd from the original RDD
Create a pattern represented by a structtype to match the line structure of the RDD created in the first step
Applying patterns through the Applyschema method on the Rdd

Val people = SC. Textfile("File:///home/hdfs/people.txt") Val schemastring ="Name Age"import org. Apache. Spark. SQL. Row;import org. Apache. Spark. SQL. Types. {Structtype,structfield,stringtype};Val schema = Structtype (schemastring. Split(" "). Map(FieldName = Structfield (fieldname,stringtype,true))) Val Rowrdd = People. Map(_. Split(",")). Map(p = Row (P)0), P (1). Trim)) Val Peopleschemardd = SqlContext. Applyschema(Rowrdd,schema) Peopleschemardd. Registertemptable("People") Val results = SqlContext. SQL("SELECT name from people")//dataframe andSupport all the normal RDD operationsresults. Map(t ="Name:"+t (0)). Collect(). foreach(println)

Result output

Name:andy
Name:justin
Name:johnsmith
Name:bob

3. Performance Tuning
Improve performance and reduce workloads primarily by caching data in memory or setting experimental options
(1) Cache data in memory
Spark SQL can cache tables that use a columnar format by calling the Sqlcontext.cachetable ("TableName") method. Then, Spark will simply browse through the columns that are needed and automatically compress the data to reduce the use of memory and the pressure of garbage collection.
You can also configure the memory cache by using the Setconf method on SqlContext or by running the Set key=value command with SQL.
(2) configuration options
You can adjust the performance of query execution with options such as Spark.sql.shuffle.partitions, Spark.sql.codegen, and so on.

4. Other
Spark SQL also supports interfaces that run SQL queries directly, without writing any code. Run the following command in the Spark directory to launch the spark SQL CLI.

./bin/spark-sql

Summary of Spark SQL and Dataframe Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More