A detailed explanation of Spark's data analysis engine: Spark SQL

Source: Internet
Author: User

Welcome to the big Data and AI technical articles released by the public number: Qing Research Academy, where you can learn the night white (author's pen name) carefully organized notes, let us make a little progress every day, so that excellent become a habit!

One, spark SQL: Similar to Hive, is a data analysis engine

What is Spark SQL?

Spark SQL can handle only structured data

The underlying relies on the RDD to convert the SQL statements into an RDD, running on a different worker

Characteristics:

1. Easy integration: SQL statements

2, to different data sources to provide a unified access: DataFrame with DataFrame shielding data source differences

3. Compatible with Hive

Outline:

Core concept: DataFrame (as a table): Is the table, the spark SQL Abstract collection of structured data

Presentation form: RDD

Table = Table Structure + data

Dataframe=schema+rdd

DataSet (new API interface as table)

How do I create a dataframe?

1. Method One: Create Dataframe by Case class

CREATE TABLE structure

Case Class EMP (Empno:int,ename:string,job:string,mgr:string,hiredata:string,sal:int,comm:string,deptno:int)

Import the Emp.csv file and specify the delimiter

Val lines = Sc.textfile ("/root/temp/emp.csv"). Map (_.split (","))

Lines.collect

Associating a table structure with data

Val allemp = Lines.map (x=>emp (x (0). toint,x (1), X (2), X (3), X (4), X (5). Toint,x (6), X (7). ToInt)

Create Dataframe:

Val empdf = allemp.todf

Operation Dataframe:

Empdf.show: Show Dataframe

Empdf.printschema: Print the table structure of the Dataframe

2, way two: through Sparksession.createdataframe () to create Dataframe

What is spark session?

Provides unified access to spark's individual modules from spark2.0: Spark session

CREATE TABLE structure: using the Structtype class

Import Org.apache.spark.sql

Import Org.apache.spark.sql.types._

Val MySchema = Structtype (List (empno:int,ename:string,job:string,mgr:string,hiredata:string,sal:int,comm:string, Deptno:int))

Import the Emp.csv file and specify the delimiter

Val lines = Sc.textfile ("/root/temp/emp.csv"). Map (_.split (","))

Associate the table structure with the data, map the read data emp.csv to a row, there is no table structure

Import.org.apache.spark.sql._

Val Rowrdd = Lines.map (x=>row (x (0). toint,x (1), X (2), X (3), X (4), X (5). Toint,x (6), X (7). ToInt)

Create a table from Sparksession.createdataframe ()

Val df = spark.createdataframe (Rowrdd,myschema)

3, mode three: directly read a format data file as Dataframe (JSON file)

Val Peopledf = Spark.read.json ("/root/training/")

4. Manipulating DATAFRAME:DSL statements and SQL statements

DSL statement: Empdf.show

Empdf.printschema

Find information for all employees: Df.show

Check the name of all employees: Df.select ("ename"). Show

Or Df.select ($ "ename"). Show

Query Employee Information: Name Salary +100

Df.select ($ "ename", $ "sal", $ "sal" +100). Show

Query employees with a salary greater than 2000

Df.filter ("Sal" >2000). Show

Group:

Df.groupby ("Deptno"). Count.show

SQL statement: You need to register dataframe as a temporary view

Df.createorreplacetempview ("EMP")

Spark.sql ("SELECT * from emp"). Show

Spark.sql ("SELECT * from emp where deptno=10"). Show

5. Temporary View: 2 kinds

1. Valid only in current session: Temporary View Df.createorreplacetempview ("EMP")

2. Effective in global scope: Global temporary View Df.createglobaltempview ("Empg")

Example: in the current session

Spark.sql ("SELECT * from emp"). Show

Spark.sql ("SELECT * from Global_temp.empg"). Show

Example: in a new session

Spark.newSession.sal ("SELECT * from emp"). Show

Spark.newSession.sal ("SELECT * from Global_temp.empg"). Show

Second, the use of data sources:

1. Load function loading data source and save function saving data source

The default data source for the load function is the parquet file

JSON function The default data source is a JSON file

Val usersdf = spark.read.load ("/root/training/spakr-2.1.0-bin-hadoop2.7/examples/")

Usersdf.select ("name", "Favorite_Color"). Show

Usersdf.select ("name", "Favorite_Color"). Write.save ("/root/temp/result")

2. Parquet file: A data source loaded by default for the Sparksql load function, files stored by column

How do I convert other file formats to parquet files?

Example: JSON file---->parquet file

Val Empjson = Spark.read.json ("/root/temp/emp.json") #直接读取一个具有格式的数据文件作为DataFrame

EmpJSON.write.parquet ("/root/temp/empparquet") #/empparquet directory cannot exist beforehand

or EmpJSON.wirte.mode ("overwrite"). Parquet ("/root/temp/result") #/result directory can be pre-existing

Features: Support for schema merging

First file: Val df1 = Sc.makerdd (1 to 5). Map (i=> (i,i*2)). TODF ("single", "Double")

Df1.write.parquet ("/root/temp/test_table/key=1")

Second file: Val df2 = Sc.makerd (6 to ten). Map (i=> (i,i*3)). TODF ("single", "Triple")

Df2.write.parquet ("/root/temp/test_table/key=2")

Merge two files: Val df3 = spark.read.option ("Mergeschema", "true"). Parquet ("/root/temp/test_table")

3. json file:

Spark.read.json ("/root/training/spark-2.1.0-bin-hadoop-2.7/examples/src/main/resources/people.json")

Spark.read.format ("JSON"). Load ("/root/training/spark-2.1.0-bin-hadoop2.7/examples/src/main/resources/ People.json ")

4. RDBMS: Need to add RDBMS driver to spark shell

Spark.read.format ("jdbc"). Option ("url", "jdbc:oracle:thin:@192.168.182.11:1521/orcl.example.com"). Option (" DBTable "," scott.emp "). Option (" User "," Scott "). Option (" Password "," Tiger "). Load

Or use the Properties class

Import Java.util.Properties

Val prop = new Properties ()

Prop.setproperty ("User", "Scott")

Prop.setproperty ("Password", "Tiger")

Val oracleDF1 = Spark.read.jdbc ("JDBC:ORACLE:THIN:@192.168.182.11:1521/ORCL")

Li Jinze Allenli, Tsinghua University in the master's degree, Research direction: Big Data and artificial intelligence

A detailed explanation of Spark's data analysis engine: Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.