Sparksql Study notes (contains local code written by idea)

Source: Internet
Author: User
Tags hdfs dfs

Spark SQL and DataFrame

1. Why use Spark SQL

Originally, we used hive to convert the hive SQL to map Reduce and then commit to the cluster to execute, greatly simplifying the complexity of the program that wrote MapReduce, because this model of mapreduce execution efficiency is slow, so spark Sql came into being, It is to convert the Sparksql into an rdd and then commit to the cluster execution, which is very efficient to execute.

Spark SQL a bit: 1, easy to integrate 2, unified data access mode 3, compatible with Hvie 4, standard data connection

2, Dataframes

What is Dataframes?

Like the Rdd, Dataframes is also a distributed data container, whereas Dataframe is more like a two-dimensional table of traditional databases, which, in addition to data, records the structure information of the data, the schema. Also, similar to hive, Dataframe supports nested data types (struct, array, and map). From the ease of use of the API, the DataFrame API provides a high-level relational operation that is more user-friendly and lower-threshold than the functional RDD API. Similar to the dataframe of R and Pandas, Spark Dataframe well inherits the development experience of traditional stand-alone data analysis.

Create Dataframes

In Spark SQL SqlContext is the portal to create Dataframes and execute SQL, in Spark-1. A sqlcontext has been built into the 5.2.

1. Create a file locally, with three columns, ID, name, age, separated by a space, and uploaded to HDFs

HDFs dfs-put person.txt/

2. In the spark shell, execute the following command to read the data and split the data for each row using the column delimiter

Val Linerdd = Sc.textfile ("Hdfs://hadoop01:9000/person.txt"). Map (_.split (""))

3. Define case Class (equivalent to the table schema)

Case class Person (Id:int, name:string, Age:int)

4. Associating the RDD and case class

Val Personrdd = linerdd.map (x = = person (x (0). ToInt, X (1), X (2). ToInt))

5. Convert the RDD to Dataframe

Val persondf = personrdd.todf

6. Handling of Dataframe

Persondf.show

Code:

Object Sparksqltest {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setmaster ("local"). Setappname ("SQL-1") Val SC=Newsparkcontext (conf) fun1 (SC)}//define the schema of the case class equivalent to the table   Case classPerson (Id:int,name:string,age:int) def fun1 (sc:sparkcontext): Unit={val SqlContext=NewSqlContext (SC)
The location is generally replaced by the HDFs file path val linerdd= Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")) Val Personrdd= Linerdd.map (X=>person (x (0). toint,x (1), X (2). ToInt))Importsqlcontext.implicits._ Val persondf=PERSONRDD.TODF//Registration FormPersondf.registertemptable ("PERSON_DF") //Incoming SQLVal df = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc") //stores the result as JSON to the specified locationDf.write.json ("D:\\data\\personout") Sc.stop ()}

DataFrame Common operations

DSL Style Grammar (personal understanding of the meaning of short and concise )

 //  View the contents of the Dataframe section column  df.select ( Persondf.col ("name"  = "Age"  ID  //  print dataframe schema information   Df.printschema ()  //  queries all the name and age and will age+2  Df.select (DF ("id"), DF ("name"), DF ("age") +2). Show ()  //  query for all ages greater than 20  Df.filter (DF ("age") >20 //
        GROUP by age and Count  df.groupby ("age") for the same number of people. Count (). Show () 

SQL style syntax (prerequisite: Dataframe must be registered as a table)

// Register as Table persondf.registertemptable ("PERSON_DF")//  Query the oldest two-bit and receive Val with Object    Persons = Sqlcontext.sql ("SELECT * from PERSON_DF ORDER BY age desc limit 2")    Persons.foreach (x=>print (x 0), X (1), X (2))

 

Direct schema assignment via Structtype

1  /*Direct schema assignment via Structtype*/2def fun2 (sc:sparkcontext): Unit = {3Val SqlContext =NewSqlContext (SC)4Val Personrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split (""))5     //specify schema for each field directly through Structtype6Val schema = Structtype (List (Structfield ("id", Integertype,true), Structfield ("name", StringType,true), Structfield ("Age", Integertype )))7     //map the Rdd to Rowrdd8Val Rowrdd = Personrdd.map (x=>row (x (0). toint,x (1). Trim,x (2). ToInt))9     //applying schema information to RowrddTenVal DataFrame =sqlcontext.createdataframe (Rowrdd,schema) One     //Register ADataframe.registertemptable ("Person_struct") -  -Sqlcontext.sql ("SELECT * from Person_struct"). Show () the  - sc.stop () -  -}

Connecting to a data source

1 /* connecting to a MySQL data source */ 2   def fun3 (sc:sparkcontext): Unit ={3     new  SqlContext (SC)4     Val jdbcdf = SqlContext.read.format ("jdbc"). Options (Map ("url", "Jdbc:mysql://192.168.180.100:3306/bigdata", " Driver "," Com.mysql.jdbc.Driver "," dbtable "," person "," user "," root "," password "," 123456 "). Load ()5    jdbcdf.show ()6    sc.stop ()7   }

and write it back into the database.

1  //Write to Database2Val Persontextrdd = Sc.textfile ("D:\\data\\person.txt"). Map (_.split ("")). Map (X=>row (x (0). toint,x (1), X (2). ToInt))3 4Val schema = Structtype (List (Structfield ("id", Integertype), Structfield ("name", StringType), Structfield ("Age", Integertype )))5 6Val Persondataframe =sqlcontext.createdataframe (Persontextrdd,schema)7 8Val prop =NewProperties ()9Prop.put ("User", "root")TenProp.put ("Password", "123456") One     //Write to Database APersonDataFrame.write.mode ("append"). JDBC ("Jdbc:mysql://192.168.180.100:3306/bigdata", "Bigdata.person", prop) -  -Sc.stop ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.