Sparksql Study notes (contains local code written by idea)

Last Update:2018-08-22 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark SQL and DataFrame

1. Why use Spark SQL

Originally, we used hive to convert the hive SQL to map Reduce and then commit to the cluster to execute, greatly simplifying the complexity of the program that wrote MapReduce, because this model of mapreduce execution efficiency is slow, so spark Sql came into being, It is to convert the Sparksql into an rdd and then commit to the cluster execution, which is very efficient to execute.

Spark SQL a bit: 1, easy to integrate 2, unified data access mode 3, compatible with Hvie 4, standard data connection

2, Dataframes

What is Dataframes?

Like the Rdd, Dataframes is also a distributed data container, whereas Dataframe is more like a two-dimensional table of traditional databases, which, in addition to data, records the structure information of the data, the schema. Also, similar to hive, Dataframe supports nested data types (struct, array, and map). From the ease of use of the API, the DataFrame API provides a high-level relational operation that is more user-friendly and lower-threshold than the functional RDD API. Similar to the dataframe of R and Pandas, Spark Dataframe well inherits the development experience of traditional stand-alone data analysis.

Create Dataframes

In Spark SQL SqlContext is the portal to create Dataframes and execute SQL, in Spark-1. A sqlcontext has been built into the 5.2.

1. Create a file locally, with three columns, ID, name, age, separated by a space, and uploaded to HDFs

HDFs dfs-put person.txt/

2. In the spark shell, execute the following command to read the data and split the data for each row using the column delimiter

Val Linerdd = Sc.textfile ("Hdfs://hadoop01:9000/person.txt"). Map (_.split (""))

3. Define case Class (equivalent to the table schema)

Case class Person (Id:int, name:string, Age:int)

4. Associating the RDD and case class

Val Personrdd = linerdd.map (x = = person (x (0). ToInt, X (1), X (2). ToInt))

5. Convert the RDD to Dataframe

Val persondf = personrdd.todf

6. Handling of Dataframe

Persondf.show

Code:

Object Sparksqltest {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setmaster ("local"). Setappname ("SQL-1") Val SC=Newsparkcontext (conf) fun1 (SC)}//define the schema of the case class equivalent to the table   Case classPerson (Id:int,name:string,age:int) def fun1 (sc:sparkcontext): Unit={val SqlContext=NewSqlContext (SC)
The location is generally replaced by the HDFs file path val linerdd= Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")) Val Personrdd= Linerdd.map (X=>person (x (0). toint,x (1), X (2). ToInt))Importsqlcontext.implicits._ Val persondf=PERSONRDD.TODF//Registration FormPersondf.registertemptable ("PERSON_DF")    //Incoming SQLVal df = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc")    //stores the result as JSON to the specified locationDf.write.json ("D:\\data\\personout") Sc.stop ()}

DataFrame Common operations

DSL Style Grammar (personal understanding of the meaning of short and concise )

 //  View the contents of the Dataframe section column  df.select ( Persondf.col ("name"  = "Age"  ID  //  print dataframe schema information   Df.printschema ()  //  queries all the name and age and will age+2  Df.select (DF ("id"), DF ("name"), DF ("age") +2). Show ()  //  query for all ages greater than 20  Df.filter (DF ("age") >20 //
        GROUP by age and Count  df.groupby ("age") for the same number of people. Count (). Show ()

SQL style syntax (prerequisite: Dataframe must be registered as a table)

// Register as Table persondf.registertemptable ("PERSON_DF")//  Query the oldest two-bit and receive Val with Object    Persons = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc limit 2")    Persons.foreach (x=>print (x 0), X (1), X (2))

Direct schema assignment via Structtype

1  /*Direct schema assignment via Structtype*/2def fun2 (sc:sparkcontext): Unit = {3Val SqlContext =NewSqlContext (SC)4Val Personrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split (""))5     //specify schema for each field directly through Structtype6Val schema = Structtype (List (Structfield ("id", Integertype,true), Structfield ("name", StringType,true), Structfield ("Age", Integertype )))7     //map the Rdd to Rowrdd8Val Rowrdd = Personrdd.map (x=>row (x (0). toint,x (1). Trim,x (2). ToInt))9     //applying schema information to RowrddTenVal DataFrame =sqlcontext.createdataframe (Rowrdd,schema) One     //Register ADataframe.registertemptable ("Person_struct") -  -Sqlcontext.sql ("SELECT * from Person_struct"). Show () the  - sc.stop () -  -}

Connecting to a data source

1 /* connecting to a MySQL data source */ 2   def fun3 (sc:sparkcontext): Unit ={3     new  SqlContext (SC)4     Val jdbcdf = SqlContext.read.format ("jdbc"). Options (Map ("url", "Jdbc:mysql://192.168.180.100:3306/bigdata", " Driver "," Com.mysql.jdbc.Driver "," dbtable "," person "," user "," root "," password "," 123456 "). Load ()5    jdbcdf.show ()6    sc.stop ()7   }

and write it back into the database.

1  //Write to Database2Val Persontextrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")). Map (X=>row (x (0). toint,x (1), X (2). ToInt))3 4Val schema = Structtype (List (Structfield ("id", Integertype), Structfield ("name", StringType), Structfield ("Age", Integertype )))5 6Val Persondataframe =sqlcontext.createdataframe (Persontextrdd,schema)7 8Val prop =NewProperties ()9Prop.put ("User", "root")TenProp.put ("Password", "123456") One     //Write to Database APersonDataFrame.write.mode ("append"). JDBC ("Jdbc:mysql://192.168.180.100:3306/bigdata", "Bigdata.person", prop) -  -Sc.stop ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More