Spark (ix)--Sparksql API programming

Last Update:2015-05-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The spark version tested in this article is 1.3.1

Text File test

A simple Person.txt file contains:

JChubby,13Looky,14LL,15

Name and age, respectively.

Create a new object in idea with the original code as follows:

object  TextFile{    def main(args:Array[String]){    }}

Sparksql Programming Model:

The first step:
Requires a SqlContext object, which is the entry for the sparksql operation
and building a SqlContext object requires a Sparkcontext

Step Two:
After building the Portal object, the implicit conversion method is introduced to convert the various files read to Dataframe,dataframe is a data type that is uniformly manipulated on sparksql

Step Three:
Build a sample class based on the format of the data. The function is to provide an implicit conversion of the various data types to be read into a unified data format for easy programming

Fourth Step:
Use the SqlContext object to read the file and convert it to Dataframe

Fifth Step:
The data is related to the operation.
1.DataFrame comes with the operation mode. Dataframe provides a number of ways to manipulate data, such as Where,select

2.DSL mode. The DSL actually uses the method provided by Dataframe, but it is easy to manipulate the properties by using the ' + property name '

3. Register data as a table and manipulate it with SQL statements

Object textfile{def main (args:array[string]) {//First step        //Build Sparkcontext object, mainly use new to call the construction method, otherwise it becomes the Apply method of using the sample class.Val sc =NewSparkcontext ()//Build SqlContext objectsVal SqlContext =NewSqlContext (SC)//Second step        ImportSqlcontext.implicits._//Step three         CasePerson (Name:string,age:int)//Fourth, Textfile reads files from the specified path if the cluster mode is to write the HDFs file address, the read file is converted to the object of the person class with two map operations, and each row corresponds to a person object; todf convert it to Dataframe Val people = Sc.textfile ("File path").Map(_.split (",")).Map{ Case(name,age) = person (name,age.toint)}.todf ()//Fifth step        //dataframe Method        println("------------------------DataFrame------------------------------------")//Race to select the record of the age>10, and then choose only the Name property, the Show method outputs itPeople.where (People ("Age") >Ten).Select(People ("Name"). Show ()//DSL         println("---------------------------DSL---------------------------------") People.where (' Age > '). Select ('Name). Show ()//sql        println("-----------------------------SQL-------------------------------")//Register people as a people formPeople.registertemptable ("People")//Use SqlContext's SQL method to write SQL statements        //The query returns an RDD, so it is collect operation, and then the loop printsSqlcontext.sql ("SELECT name from people where age >"). Collect.foreach (println)//Save as Parquet file, after which the Parquet demo will usePeople.saveasparquet ("Saved Paths")    }}

Parquet format file Test:

Val sc =NewSparkcontext () Val sql =NewSqlContext (SC)ImportSql.implicits._ val parquet = sql.parquetfile (args(0))println("------------------------DataFrame------------------------------------")println(Parquet.where (Parquet ("Age") >Ten).Select(Parquet ("Name")). Show ())println("---------------------------DSL---------------------------------")println(Parquet.where (' Age > '). Select ('Name). Show ())println("-----------------------------SQL-------------------------------") Parquet.registertemptable ("Parquet") Sql.sql ("SELECT name from Parquet where age >").Map(P ="Name:"+ P(0). Collect (). foreach (println)

JSON format test:

Val sc =NewSparkcontext () Val sql =NewSqlContext (SC)ImportSql.implicits._ val json = sql.jsonfile (args(0))println("------------------------DataFrame------------------------------------")println(Json.where (JSON ("Age") >Ten).Select(JSON ("Name")). Show ())println("---------------------------DSL---------------------------------")println(Json.where (' Age > '). Select ('Name). Show ())println("-----------------------------SQL-------------------------------") Json.registertemptable ("JSON") Sql.sql ("SELECT name from JSON where age > ten").Map(P ="Name:"+ P(0). Collect (). foreach (println)

You can see that the above code is almost identical to reading a text file, ignoring the Parquetfile/jsonfile method used by SC while reading the file, and then the operation is a touch of the same
Since parquet and JSON data are read in an actionable format and are automatically converted to Dataframe, the definition steps and TODF of the case class are omitted.

The above is a simple use of the Sparksql API

Spark (ix)--Sparksql API programming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More