Spark (ix)--Sparksql API programming

Source: Internet
Author: User

The spark version tested in this article is 1.3.1

Text File test

A simple Person.txt file contains:

JChubby,13Looky,14LL,15

Name and age, respectively.

Create a new object in idea with the original code as follows:

object  TextFile{    def main(args:Array[String]){    }}

Sparksql Programming Model:

The first step:
Requires a SqlContext object, which is the entry for the sparksql operation
and building a SqlContext object requires a Sparkcontext

Step Two:
After building the Portal object, the implicit conversion method is introduced to convert the various files read to Dataframe,dataframe is a data type that is uniformly manipulated on sparksql

Step Three:
Build a sample class based on the format of the data. The function is to provide an implicit conversion of the various data types to be read into a unified data format for easy programming

Fourth Step:
Use the SqlContext object to read the file and convert it to Dataframe

Fifth Step:
The data is related to the operation.
1.DataFrame comes with the operation mode. Dataframe provides a number of ways to manipulate data, such as Where,select

2.DSL mode. The DSL actually uses the method provided by Dataframe, but it is easy to manipulate the properties by using the ' + property name '

3. Register data as a table and manipulate it with SQL statements

Object textfile{def main (args:array[string]) {//First step        //Build Sparkcontext object, mainly use new to call the construction method, otherwise it becomes the Apply method of using the sample class.Val sc =NewSparkcontext ()//Build SqlContext objectsVal SqlContext =NewSqlContext (SC)//Second step        ImportSqlcontext.implicits._//Step three         CasePerson (Name:string,age:int)//Fourth, Textfile reads files from the specified path if the cluster mode is to write the HDFs file address, the read file is converted to the object of the person class with two map operations, and each row corresponds to a person object; todf convert it to Dataframe Val people = Sc.textfile ("File path").Map(_.split (",")).Map{ Case(name,age) = person (name,age.toint)}.todf ()//Fifth step        //dataframe Method        println("------------------------DataFrame------------------------------------")//Race to select the record of the age>10, and then choose only the Name property, the Show method outputs itPeople.where (People ("Age") >Ten).Select(People ("Name"). Show ()//DSL         println("---------------------------DSL---------------------------------") People.where (' Age > '). Select ('Name). Show ()//sql        println("-----------------------------SQL-------------------------------")//Register people as a people formPeople.registertemptable ("People")//Use SqlContext's SQL method to write SQL statements        //The query returns an RDD, so it is collect operation, and then the loop printsSqlcontext.sql ("SELECT name from people where age >"). Collect.foreach (println)//Save as Parquet file, after which the Parquet demo will usePeople.saveasparquet ("Saved Paths")    }}

Parquet format file Test:

Val sc =NewSparkcontext () Val sql =NewSqlContext (SC)ImportSql.implicits._ val parquet = sql.parquetfile (args(0))println("------------------------DataFrame------------------------------------")println(Parquet.where (Parquet ("Age") >Ten).Select(Parquet ("Name")). Show ())println("---------------------------DSL---------------------------------")println(Parquet.where (' Age > '). Select ('Name). Show ())println("-----------------------------SQL-------------------------------") Parquet.registertemptable ("Parquet") Sql.sql ("SELECT name from Parquet where age >").Map(P ="Name:"+ P(0). Collect (). foreach (println)

JSON format test:

Val sc =NewSparkcontext () Val sql =NewSqlContext (SC)ImportSql.implicits._ val json = sql.jsonfile (args(0))println("------------------------DataFrame------------------------------------")println(Json.where (JSON ("Age") >Ten).Select(JSON ("Name")). Show ())println("---------------------------DSL---------------------------------")println(Json.where (' Age > '). Select ('Name). Show ())println("-----------------------------SQL-------------------------------") Json.registertemptable ("JSON") Sql.sql ("SELECT name from JSON where age > ten").Map(P ="Name:"+ P(0). Collect (). foreach (println)

You can see that the above code is almost identical to reading a text file, ignoring the Parquetfile/jsonfile method used by SC while reading the file, and then the operation is a touch of the same
Since parquet and JSON data are read in an actionable format and are automatically converted to Dataframe, the definition steps and TODF of the case class are omitted.

The above is a simple use of the Sparksql API

Spark (ix)--Sparksql API programming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.