Spark SQL and DataFrame
1. Why use Spark SQL
Originally, we used hive to convert the hive SQL to map Reduce and then commit to the cluster to execute, greatly simplifying the complexity of the program that wrote MapReduce, because this model of mapreduce execution efficiency is slow, so spark Sql came into being, It is to convert the Sparksql into an rdd and then commit to the cluster execution, which is very efficient to execute.
Spark SQL a bit: 1, easy to integrate 2, unified data access mode 3, compatible with Hvie 4, standard data connection
2, Dataframes
What is Dataframes?
Like the Rdd, Dataframes is also a distributed data container, whereas Dataframe is more like a two-dimensional table of traditional databases, which, in addition to data, records the structure information of the data, the schema. Also, similar to hive, Dataframe supports nested data types (struct, array, and map). From the ease of use of the API, the DataFrame API provides a high-level relational operation that is more user-friendly and lower-threshold than the functional RDD API. Similar to the dataframe of R and Pandas, Spark Dataframe well inherits the development experience of traditional stand-alone data analysis.
Create Dataframes
In Spark SQL SqlContext is the portal to create Dataframes and execute SQL, in Spark-1. A sqlcontext has been built into the 5.2.
1. Create a file locally, with three columns, ID, name, age, separated by a space, and uploaded to HDFs
HDFs dfs-put person.txt/
2. In the spark shell, execute the following command to read the data and split the data for each row using the column delimiter
Val Linerdd = Sc.textfile ("Hdfs://hadoop01:9000/person.txt"). Map (_.split (""))
3. Define case Class (equivalent to the table schema)
Case class Person (Id:int, name:string, Age:int)
4. Associating the RDD and case class
Val Personrdd = linerdd.map (x = = person (x (0). ToInt, X (1), X (2). ToInt))
5. Convert the RDD to Dataframe
Val persondf = personrdd.todf
6. Handling of Dataframe
Persondf.show
Code:
Object Sparksqltest {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setmaster ("local"). Setappname ("SQL-1") Val SC=Newsparkcontext (conf) fun1 (SC)}//define the schema of the case class equivalent to the table Case classPerson (Id:int,name:string,age:int) def fun1 (sc:sparkcontext): Unit={val SqlContext=NewSqlContext (SC)
The location is generally replaced by the HDFs file path val linerdd= Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")) Val Personrdd= Linerdd.map (X=>person (x (0). toint,x (1), X (2). ToInt))Importsqlcontext.implicits._ Val persondf=PERSONRDD.TODF//Registration FormPersondf.registertemptable ("PERSON_DF") //Incoming SQLVal df = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc") //stores the result as JSON to the specified locationDf.write.json ("D:\\data\\personout") Sc.stop ()}
DataFrame Common operations
DSL Style Grammar (personal understanding of the meaning of short and concise )
// View the contents of the Dataframe section column df.select ( Persondf.col ("name" = "Age" ID // print dataframe schema information Df.printschema () // queries all the name and age and will age+2 Df.select (DF ("id"), DF ("name"), DF ("age") +2). Show () // query for all ages greater than 20 Df.filter (DF ("age") >20 //
GROUP by age and Count df.groupby ("age") for the same number of people. Count (). Show ()
SQL style syntax (prerequisite: Dataframe must be registered as a table)
// Register as Table persondf.registertemptable ("PERSON_DF")// Query the oldest two-bit and receive Val with Object Persons = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc limit 2") Persons.foreach (x=>print (x 0), X (1), X (2))
Direct schema assignment via Structtype
1 /*Direct schema assignment via Structtype*/2def fun2 (sc:sparkcontext): Unit = {3Val SqlContext =NewSqlContext (SC)4Val Personrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split (""))5 //specify schema for each field directly through Structtype6Val schema = Structtype (List (Structfield ("id", Integertype,true), Structfield ("name", StringType,true), Structfield ("Age", Integertype )))7 //map the Rdd to Rowrdd8Val Rowrdd = Personrdd.map (x=>row (x (0). toint,x (1). Trim,x (2). ToInt))9 //applying schema information to RowrddTenVal DataFrame =sqlcontext.createdataframe (Rowrdd,schema) One //Register ADataframe.registertemptable ("Person_struct") - -Sqlcontext.sql ("SELECT * from Person_struct"). Show () the - sc.stop () - -}
Connecting to a data source
1 /* connecting to a MySQL data source */ 2 def fun3 (sc:sparkcontext): Unit ={3 new SqlContext (SC)4 Val jdbcdf = SqlContext.read.format ("jdbc"). Options (Map ("url", "Jdbc:mysql://192.168.180.100:3306/bigdata", " Driver "," Com.mysql.jdbc.Driver "," dbtable "," person "," user "," root "," password "," 123456 "). Load ()5 jdbcdf.show ()6 sc.stop ()7 }
and write it back into the database.
1 //Write to Database2Val Persontextrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")). Map (X=>row (x (0). toint,x (1), X (2). ToInt))3 4Val schema = Structtype (List (Structfield ("id", Integertype), Structfield ("name", StringType), Structfield ("Age", Integertype )))5 6Val Persondataframe =sqlcontext.createdataframe (Persontextrdd,schema)7 8Val prop =NewProperties ()9Prop.put ("User", "root")TenProp.put ("Password", "123456") One //Write to Database APersonDataFrame.write.mode ("append"). JDBC ("Jdbc:mysql://192.168.180.100:3306/bigdata", "Bigdata.person", prop) - -Sc.stop ()