Welcome to the big Data and AI technical articles released by the public number: Qing Research Academy, where you can learn the night white (author's pen name) carefully organized notes, let us make a little progress every day, so that excellent become a habit!
One, spark SQL: Similar to Hive, is a data analysis engine
What is Spark SQL?
Spark SQL can handle only structured data
The underlying relies on the RDD to convert the SQL statements into an RDD, running on a different worker
Characteristics:
1. Easy integration: SQL statements
2, to different data sources to provide a unified access: DataFrame with DataFrame shielding data source differences
3. Compatible with Hive
Outline:
Core concept: DataFrame (as a table): Is the table, the spark SQL Abstract collection of structured data
Presentation form: RDD
Table = Table Structure + data
Dataframe=schema+rdd
DataSet (new API interface as table)
How do I create a dataframe?
1. Method One: Create Dataframe by Case class
CREATE TABLE structure
Case Class EMP (Empno:int,ename:string,job:string,mgr:string,hiredata:string,sal:int,comm:string,deptno:int)
Import the Emp.csv file and specify the delimiter
Val lines = Sc.textfile ("/root/temp/emp.csv"). Map (_.split (","))
Lines.collect
Associating a table structure with data
Val allemp = Lines.map (x=>emp (x (0). toint,x (1), X (2), X (3), X (4), X (5). Toint,x (6), X (7). ToInt)
Create Dataframe:
Val empdf = allemp.todf
Operation Dataframe:
Empdf.show: Show Dataframe
Empdf.printschema: Print the table structure of the Dataframe
2, way two: through Sparksession.createdataframe () to create Dataframe
What is spark session?
Provides unified access to spark's individual modules from spark2.0: Spark session
CREATE TABLE structure: using the Structtype class
Import Org.apache.spark.sql
Import Org.apache.spark.sql.types._
Val MySchema = Structtype (List (empno:int,ename:string,job:string,mgr:string,hiredata:string,sal:int,comm:string, Deptno:int))
Import the Emp.csv file and specify the delimiter
Val lines = Sc.textfile ("/root/temp/emp.csv"). Map (_.split (","))
Associate the table structure with the data, map the read data emp.csv to a row, there is no table structure
Import.org.apache.spark.sql._
Val Rowrdd = Lines.map (x=>row (x (0). toint,x (1), X (2), X (3), X (4), X (5). Toint,x (6), X (7). ToInt)
Create a table from Sparksession.createdataframe ()
Val df = spark.createdataframe (Rowrdd,myschema)
3, mode three: directly read a format data file as Dataframe (JSON file)
Val Peopledf = Spark.read.json ("/root/training/")
4. Manipulating DATAFRAME:DSL statements and SQL statements
DSL statement: Empdf.show
Empdf.printschema
Find information for all employees: Df.show
Check the name of all employees: Df.select ("ename"). Show
Or Df.select ($ "ename"). Show
Query Employee Information: Name Salary +100
Df.select ($ "ename", $ "sal", $ "sal" +100). Show
Query employees with a salary greater than 2000
Df.filter ("Sal" >2000). Show
Group:
Df.groupby ("Deptno"). Count.show
SQL statement: You need to register dataframe as a temporary view
Df.createorreplacetempview ("EMP")
Spark.sql ("SELECT * from emp"). Show
Spark.sql ("SELECT * from emp where deptno=10"). Show
5. Temporary View: 2 kinds
1. Valid only in current session: Temporary View Df.createorreplacetempview ("EMP")
2. Effective in global scope: Global temporary View Df.createglobaltempview ("Empg")
Example: in the current session
Spark.sql ("SELECT * from emp"). Show
Spark.sql ("SELECT * from Global_temp.empg"). Show
Example: in a new session
Spark.newSession.sal ("SELECT * from emp"). Show
Spark.newSession.sal ("SELECT * from Global_temp.empg"). Show
Second, the use of data sources:
1. Load function loading data source and save function saving data source
The default data source for the load function is the parquet file
JSON function The default data source is a JSON file
Val usersdf = spark.read.load ("/root/training/spakr-2.1.0-bin-hadoop2.7/examples/")
Usersdf.select ("name", "Favorite_Color"). Show
Usersdf.select ("name", "Favorite_Color"). Write.save ("/root/temp/result")
2. Parquet file: A data source loaded by default for the Sparksql load function, files stored by column
How do I convert other file formats to parquet files?
Example: JSON file---->parquet file
Val Empjson = Spark.read.json ("/root/temp/emp.json") #直接读取一个具有格式的数据文件作为DataFrame
EmpJSON.write.parquet ("/root/temp/empparquet") #/empparquet directory cannot exist beforehand
or EmpJSON.wirte.mode ("overwrite"). Parquet ("/root/temp/result") #/result directory can be pre-existing
Features: Support for schema merging
First file: Val df1 = Sc.makerdd (1 to 5). Map (i=> (i,i*2)). TODF ("single", "Double")
Df1.write.parquet ("/root/temp/test_table/key=1")
Second file: Val df2 = Sc.makerd (6 to ten). Map (i=> (i,i*3)). TODF ("single", "Triple")
Df2.write.parquet ("/root/temp/test_table/key=2")
Merge two files: Val df3 = spark.read.option ("Mergeschema", "true"). Parquet ("/root/temp/test_table")
3. json file:
Spark.read.json ("/root/training/spark-2.1.0-bin-hadoop-2.7/examples/src/main/resources/people.json")
Spark.read.format ("JSON"). Load ("/root/training/spark-2.1.0-bin-hadoop2.7/examples/src/main/resources/ People.json ")
4. RDBMS: Need to add RDBMS driver to spark shell
Spark.read.format ("jdbc"). Option ("url", "jdbc:oracle:thin:@192.168.182.11:1521/orcl.example.com"). Option (" DBTable "," scott.emp "). Option (" User "," Scott "). Option (" Password "," Tiger "). Load
Or use the Properties class
Import Java.util.Properties
Val prop = new Properties ()
Prop.setproperty ("User", "Scott")
Prop.setproperty ("Password", "Tiger")
Val oracleDF1 = Spark.read.jdbc ("JDBC:ORACLE:THIN:@192.168.182.11:1521/ORCL")
Li Jinze Allenli, Tsinghua University in the master's degree, Research direction: Big Data and artificial intelligence
A detailed explanation of Spark's data analysis engine: Spark SQL