Spark SQL Overview

Last Update:2017-08-16 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface: Some logic with spark core to write, it will be more trouble, if the use of SQL to express, it is too convenient

First, what is Spark SQL

is a Spark component that specifically handles structured data

Spark SQL provides two ways to manipulate data:

SQL query

Dataframes/datasets API

Spark SQL = Schema + RDD

Second, Spark SQL introduced the main motive

Write and run spark programs faster

Write less code, read less data, let the optimizer automatically optimize the program, release the programmer's work

Third, Spark SQL overall architecture

Spark SQL is at the bottom of Spark Core, and the Catalyst above is an execution plan optimizer that can help refine the query

On the Catalyst there are two components, SQL and Dataframe/dataset, the two components of the upper layer corresponding interface is different, SQL corresponds to the pure SQL statement input, dataframe/dataset corresponding to their API generated input

The final results of both SQL and Dataframe/dataset are entered into the Catalyst optimizer, and, when optimized, the final result is given to the Spark Core to run

The following three lines are the Spark SQL suite, with some more advanced APIs, such as machine learning, etc.

Iv. SQL and Dataframe/dataset

Spark provides two APIs for writing Spark SQL programs, using SQL queries or using Dataframe/dataset

Using SQL

If you are very familiar with SQL syntax, use SQL

Using Dataframe/dataset

DSL (Domain specific Language): DSL is a specification, such as the previous figure in the Table,avg,groupby is, the Internet self-search DSL

Use more general language (Scala,python) to express your query needs

Faster catch errors using DataFrame: SQL is not checked at compile time, run-time checks, and DataFrame is checked at compile time, such as checking the existence of a column and whether the column type is correct

V. Spark SQL API Evolution

After 1.3 introduced the DataFrame, but later found some restrictions, and introduced a dataset, originally they were two different APIs, later found that DataFrame and datasets have interoperability, so 2.0 inside, DataFrame became a Dataset The subset

1. RDD API (2011)

Distributed data collection consisting of JVM objects

Immutable and fault-tolerant

Can handle structured and unstructured data

Function-Type conversions

2. The limitations of the RDD API

No schema

User-Optimized program

Reading data from a different data source is difficult

Merging data from multiple data sources is also difficult

3. DataFrame API (2013)

A distributed collection of data consisting of a row object: A DataSet has a number of records, each record is a row object, the row is stored in the heart, what columns are included, what column names are, what data type each column is, and in DataFrame there is this information.

Immutable and fault-tolerant

Working with structured data

Self-optimizing catalyst, which automatically optimizes the program

One of the more convenient data source Api:dataframe than the RDD API is that there is a data source API that allows users to easily read data from a variety of data sources

4. Limitations of DataFrame API

Run-Time type checking

Cannot manipulate domain object directly

Functional Programming Style

Example:

Val dataframe = SqlContext.read.json ("People.json ")Dataframe.filter ("Salary >"). Show ()//Limitations Throws Runtime ExceptionOrg.apache.spark.sql.AnalysisException:cannot Resolve'Salary'given input columns age,name;//Create Rdd[person]Val Personrdd = Sc.makerdd (Seq (Person ("A",Ten), Person ("B", -)))//Create Dataframe from a Rdd[person]Val persondf =sqlcontext.createdataframe (Personrdd)//We get back Rdd[row] and not Rdd[person] Persondf.rdd//limitations The RDD is converted to DF, and DF is returned to the RDD, and some information is lost

Note: The difference between the Spark RDD, Dataframe, and DataSet is self-searching online

5. Dataset

The Dataset extends from the DataFrame API, which provides compile-time type-safe, object-oriented style APIs

 Case class  = SqlContext.read.json ("People.json"= Dataframe.as[person]//  Compute Histogramof age by name val hist = ds.groupby (_.name). mapgroups ({ case (name, people) =
    new array[int]+ = 1} (name, buckets)})

View Code

Dataset API

Type safety: Works directly on domain objects

// Create Rdd[person]val personrdd = Sc.makerdd (Seq (Person ("A", ten), Person ("B", +))//  Create Dataset from a RDDval personds = sqlcontext.createdataset (personrdd) Personds.rdd//  Not Rdd[row] in Dataframe

Efficient: Code generation codecs for more efficient serialization

Collaboration: Datasets and Dataframe can be converted to each other

Compile-Time type checking

 Case class  = SqlContext.read.json ("people.json"= dataframe.  as
 12500)//Error:value salary is not a member the person

Spark SQL Overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More