Developing spark applications using Scala language

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Applications DFS language development steps can

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Developing spark applications with Scala language [goto: Dong's blog http://www.dongxicheng.org]

The spark kernel is developed by the Scala language, so it is natural to develop spark applications using Scala. If you are unfamiliar with the Scala language, you can read Web tutorials a Scala Tutorial for Java programmers or related Scala books to learn.

This article introduces 3 Scala spark programming examples, WordCount, TOPK, and Sparkjoin, representing three typical applications of spark.

1. WordCount Programming Example

WordCount is one of the simplest examples of distributed applications, the main function is to count the total number of occurrences of all the words in the input directory, and write the following steps:

Step 1:

Creates a Sparkcontext object that has four parameters: Spark Master location, application name, Spark installation directory, and Jar storage location, and for Spark on yarn, the first two parameters are the most important, as " Yarn-standalone ", the second parameter is a custom string, for example:

Val sc = new Sparkcontext (args (0), "WordCount",

System.getenv ("Spark_home"), Seq (System.getenv ("Spark_test_jar"))

Step 2:

Reads the input data. To read the text data from the HDFs, you can convert the input file to a RDD using the Textfile function in Sparkcontext, which uses the Textinputformat parse input data in Hadoop, for example:

Val textfile = Sc.textfile (args (1))

Of course, Spark allows you to use any Hadoop inputformat, such as binary input format Sequencefileinputformat, at which point you can work with the Hadooprdd function in Sparkcontext, for example:

Val inputformatclass = Classof[sequencefileinputformat[text,text]]

var Hadooprdd = Sc.hadooprdd (conf, Inputformatclass, Classof[text], Classof[text])

Or create a Hadooprdd object directly:

var Hadooprdd = new Hadooprdd (SC, conf,

Classof[sequencefileinputformat[text,text, Classof[text], Classof[text]

Step 3:

by RDD conversion operator and conversion rdd, for WordCount, first you need to parse out the words from each line in the input data, then put the same words in a bucket, and then count the frequency of each word in each bucket, for example:

Val result = hadooprdd.flatmap{

Case (key, value) => value.tostring (). Split ("\\s+");

}.map (Word => (Word, 1)). Reducebykey (_ + _)

Where the Flatmap function converts a record into multiple records (One-to-many relationships), the map function converts a record to another record (one-to-one relationship), and the Reducebykey function divides the same data into a bucket and calculates it in key units. The specific meaning of these functions can be referred to: Spark transformation.

Step 4:

Saves the resulting RDD dataset to the HDFs. You can use the Saveastextfile in Sparkcontext to save the dataset to the HDFs directory, default to the Textoutputformat provided by Hadoop, and print out each record in "(Key,value)" form. You can also use the Saveassequencefile function to save data as a sequencefile format, and so on, for example:

Result.saveassequencefile (args (2))

Of course, when we write the spark program, we need to include the following two header files:

Import Org.apache.spark._

Import Sparkcontext._

The WordCount complete program has been introduced in the "Apache Spark Learning: Using Eclipse to build the spark integrated development environment," which is not discussed at times.

Note that when specifying the input output file, you need to specify the URI of the HDFs, such as the input directory is hdfs://hadoop-test/tmp/input, the output directory is hdfs://hadoop-test/tmp/output, where, "HDFs ://hadoop-test "is specified by the Core-site.xml parameter Fs.default.name in the Hadoop configuration file, replacing it with your configuration."

2. TOPK Programming Example

The task of the TOPK program is to count the frequency of a stack of text and return the most frequent k-words. If implemented with MapReduce, you need to write two jobs: WordCount and TOPK, and spark with just one job, where the wordcount part has been implemented by the front, and then follow the previous implementation to find top K. Note that the implementation of this article is not optimal, there is a lot of room for improvement.

Step 1:

First you need to sort all the words according to the word frequency, as follows:

Val sorted = Result.map {

Case (key, value) => (value, key); Exchange Key and value

}.sortbykey (True, 1)

Step 2: Return the first k:

Val topk = Sorted.top (args (3). ToInt)

Step 3: Print the K words:

Topk.foreach (println)

Note that for the content of the application standard output, yarn will be saved to the container stdout log. In yarn, there are three log files for each container, respectively stdout, stderr, and Syslog, the first two of which are stored in standard output, and the third is the log4j-printed log, usually with only the third log content.

The complete code of this program, the compiled jar package and the run script can be downloaded from here. After downloading, follow the "Apache Spark Learning: Building a spark integrated development environment using Eclipse" to run the process.

3. Sparkjoin Programming Example

In the recommended field there is a well-known open test set is movielens, the download link is: http://grouplens.org/datasets/movielens/, the test set contains three files, respectively, Ratings.dat, Sers.dat , Movies.dat, specific to read: README.txt, the Sparkjoin instance given in this section gets a list of movies with an average score of more than 4.0 by connecting Ratings.dat and Movies.dat two files, using the dataset: Ml-1m. The program code is as follows:

Import Org.apache.spark._

Import Sparkcontext._

Object Sparkjoin {

def main (args:array[string]) {

if (args.length!= 4) {

println ("Usage is Org.test.WordCount")

Return

}

Val sc = new Sparkcontext (args (0), "WordCount",

System.getenv ("Spark_home"), Seq (System.getenv ("Spark_test_jar"))

Read rating from HDFS file

Val textfile = Sc.textfile (args (1))

Extract (MovieID, rating)

Val rating = Textfile.map (line => {

Val fileds = Line.split ("::")

(Fileds (1). ToInt, Fileds (2). ToDouble)

})

Val Moviescores = Rating

. Groupbykey ()

. Map (Data => {

Val avg = data._2.sum/data._2.size

(Data._1, Avg)

})

Read Movie from HDFS file

Val movies = Sc.textfile (args (2))

Val movieskey = Movies.map (line => {

Val fileds = Line.split ("::")

(fileds (0). ToInt, Fileds (1))

}). keyby (Tup => tup._1)

By join, we get <movie, moviename= "averagerating,=" >

Val result = Moviescores

. keyby (Tup => tup._1)

. Join (Movieskey)

. filter (f => f._2._1._2 > 4.0)

. Map (f => (F._1, f._2._1._2, f._2._2._2))

Result.saveastextfile (args (3))

}

You can download the code, compile the jar package and run the script from here.

This program directly use spark write some trouble, you can write directly on the Shark HQL implementation, Shark is based on spark similar to hive Interactive query engine, specific reference: Shark.

4. Summary

Spark programming is not very demanding for Scala languages, just as Hadoop programming is not very demanding in Java languages, as long as you have the most basic syntax to write programs, and there are few common syntax and expressions. Typically, a program has just begun to be modeled on official examples, including Scala, Java, and Python three language instances.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More