Preface: Some logic with spark core to write, it will be more trouble, if the use of SQL to express, it is too convenient
First, what is Spark SQL
is a Spark component that specifically handles structured data
Spark SQL provides two ways to manipulate data:
SQL query
Dataframes/datasets API
Spark SQL = Schema + RDD
Second, Spark SQL introduced the main motive
Write and run spark programs faster
Write less code, read less data, let the optimizer automatically optimize the program, release the programmer's work
Third, Spark SQL overall architecture
Spark SQL is at the bottom of Spark Core, and the Catalyst above is an execution plan optimizer that can help refine the query
On the Catalyst there are two components, SQL and Dataframe/dataset, the two components of the upper layer corresponding interface is different, SQL corresponds to the pure SQL statement input, dataframe/dataset corresponding to their API generated input
The final results of both SQL and Dataframe/dataset are entered into the Catalyst optimizer, and, when optimized, the final result is given to the Spark Core to run
The following three lines are the Spark SQL suite, with some more advanced APIs, such as machine learning, etc.
Iv. SQL and Dataframe/dataset
Spark provides two APIs for writing Spark SQL programs, using SQL queries or using Dataframe/dataset
Using SQL
If you are very familiar with SQL syntax, use SQL
Using Dataframe/dataset
DSL (Domain specific Language): DSL is a specification, such as the previous figure in the Table,avg,groupby is, the Internet self-search DSL
Use more general language (Scala,python) to express your query needs
Faster catch errors using DataFrame: SQL is not checked at compile time, run-time checks, and DataFrame is checked at compile time, such as checking the existence of a column and whether the column type is correct
V. Spark SQL API Evolution
After 1.3 introduced the DataFrame, but later found some restrictions, and introduced a dataset, originally they were two different APIs, later found that DataFrame and datasets have interoperability, so 2.0 inside, DataFrame became a Dataset The subset
1. RDD API (2011)
Distributed data collection consisting of JVM objects
Immutable and fault-tolerant
Can handle structured and unstructured data
Function-Type conversions
2. The limitations of the RDD API
No schema
User-Optimized program
Reading data from a different data source is difficult
Merging data from multiple data sources is also difficult
3. DataFrame API (2013)
A distributed collection of data consisting of a row object: A DataSet has a number of records, each record is a row object, the row is stored in the heart, what columns are included, what column names are, what data type each column is, and in DataFrame there is this information.
Immutable and fault-tolerant
Working with structured data
Self-optimizing catalyst, which automatically optimizes the program
One of the more convenient data source Api:dataframe than the RDD API is that there is a data source API that allows users to easily read data from a variety of data sources
4. Limitations of DataFrame API
Run-Time type checking
Cannot manipulate domain object directly
Functional Programming Style
Example:
Val dataframe = SqlContext.read.json ("People.json ")Dataframe.filter ("Salary >"). Show ()//Limitations Throws Runtime ExceptionOrg.apache.spark.sql.AnalysisException:cannot Resolve'Salary'given input columns age,name;//Create Rdd[person]Val Personrdd = Sc.makerdd (Seq (Person ("A",Ten), Person ("B", -)))//Create Dataframe from a Rdd[person]Val persondf =sqlcontext.createdataframe (Personrdd)//We get back Rdd[row] and not Rdd[person] Persondf.rdd//limitations The RDD is converted to DF, and DF is returned to the RDD, and some information is lost
Note: The difference between the Spark RDD, Dataframe, and DataSet is self-searching online
5. Dataset
The Dataset extends from the DataFrame API, which provides compile-time type-safe, object-oriented style APIs
Case class = SqlContext.read.json ("People.json"= Dataframe.as[person]// Compute Histogramof age by name val hist = ds.groupby (_.name). mapgroups ({ case (name, people) =
new array[int]+ = 1} (name, buckets)})
View Code
Dataset API
Type safety: Works directly on domain objects
// Create Rdd[person]val personrdd = Sc.makerdd (Seq (Person ("A", ten), Person ("B", +))// Create Dataset from a RDDval personds = sqlcontext.createdataset (personrdd) Personds.rdd// Not Rdd[row] in Dataframe
Efficient: Code generation codecs for more efficient serialization
Collaboration: Datasets and Dataframe can be converted to each other
Compile-Time type checking
Case class = SqlContext.read.json ("people.json"= dataframe. as
12500)//Error:value salary is not a member the person
Spark SQL Overview