Developing spark applications using Scala language

Source: Internet
Author: User
Keywords Applications DFS language development steps can

Developing spark applications with Scala language [goto: Dong's blog http://www.dongxicheng.org]

The spark kernel is developed by the Scala language, so it is natural to develop spark applications using Scala. If you are unfamiliar with the Scala language, you can read Web tutorials a Scala Tutorial for Java programmers or related Scala books to learn.

This article introduces 3 Scala spark programming examples, WordCount, TOPK, and Sparkjoin, representing three typical applications of spark.

1. WordCount Programming Example

WordCount is one of the simplest examples of distributed applications, the main function is to count the total number of occurrences of all the words in the input directory, and write the following steps:

Step 1:

Creates a Sparkcontext object that has four parameters: Spark Master location, application name, Spark installation directory, and Jar storage location, and for Spark on yarn, the first two parameters are the most important, as " Yarn-standalone ", the second parameter is a custom string, for example:

Val sc = new Sparkcontext (args (0), "WordCount",

System.getenv ("Spark_home"), Seq (System.getenv ("Spark_test_jar"))

Step 2:

Reads the input data. To read the text data from the HDFs, you can convert the input file to a RDD using the Textfile function in Sparkcontext, which uses the Textinputformat parse input data in Hadoop, for example:

Val textfile = Sc.textfile (args (1))

Of course, Spark allows you to use any Hadoop inputformat, such as binary input format Sequencefileinputformat, at which point you can work with the Hadooprdd function in Sparkcontext, for example:

Val inputformatclass = Classof[sequencefileinputformat[text,text]]

var Hadooprdd = Sc.hadooprdd (conf, Inputformatclass, Classof[text], Classof[text])

Or create a Hadooprdd object directly:

var Hadooprdd = new Hadooprdd (SC, conf,

Classof[sequencefileinputformat[text,text, Classof[text], Classof[text]

Step 3:

by RDD conversion operator and conversion rdd, for WordCount, first you need to parse out the words from each line in the input data, then put the same words in a bucket, and then count the frequency of each word in each bucket, for example:

Val result = hadooprdd.flatmap{

Case (key, value) => value.tostring (). Split ("\\s+");

}.map (Word => (Word, 1)). Reducebykey (_ + _)

Where the Flatmap function converts a record into multiple records (One-to-many relationships), the map function converts a record to another record (one-to-one relationship), and the Reducebykey function divides the same data into a bucket and calculates it in key units. The specific meaning of these functions can be referred to: Spark transformation.

Step 4:

Saves the resulting RDD dataset to the HDFs. You can use the Saveastextfile in Sparkcontext to save the dataset to the HDFs directory, default to the Textoutputformat provided by Hadoop, and print out each record in "(Key,value)" form. You can also use the Saveassequencefile function to save data as a sequencefile format, and so on, for example:

Result.saveassequencefile (args (2))

Of course, when we write the spark program, we need to include the following two header files:

Import Org.apache.spark._

Import Sparkcontext._

The WordCount complete program has been introduced in the "Apache Spark Learning: Using Eclipse to build the spark integrated development environment," which is not discussed at times.

Note that when specifying the input output file, you need to specify the URI of the HDFs, such as the input directory is hdfs://hadoop-test/tmp/input, the output directory is hdfs://hadoop-test/tmp/output, where, "HDFs ://hadoop-test "is specified by the Core-site.xml parameter Fs.default.name in the Hadoop configuration file, replacing it with your configuration."

2. TOPK Programming Example

The task of the TOPK program is to count the frequency of a stack of text and return the most frequent k-words. If implemented with MapReduce, you need to write two jobs: WordCount and TOPK, and spark with just one job, where the wordcount part has been implemented by the front, and then follow the previous implementation to find top K. Note that the implementation of this article is not optimal, there is a lot of room for improvement.

Step 1:

First you need to sort all the words according to the word frequency, as follows:

Val sorted = Result.map {

Case (key, value) => (value, key); Exchange Key and value

}.sortbykey (True, 1)

Step 2: Return the first k:

Val topk = Sorted.top (args (3). ToInt)

Step 3: Print the K words:

Topk.foreach (println)

Note that for the content of the application standard output, yarn will be saved to the container stdout log. In yarn, there are three log files for each container, respectively stdout, stderr, and Syslog, the first two of which are stored in standard output, and the third is the log4j-printed log, usually with only the third log content.

The complete code of this program, the compiled jar package and the run script can be downloaded from here. After downloading, follow the "Apache Spark Learning: Building a spark integrated development environment using Eclipse" to run the process.

3. Sparkjoin Programming Example

In the recommended field there is a well-known open test set is movielens, the download link is: http://grouplens.org/datasets/movielens/, the test set contains three files, respectively, Ratings.dat, Sers.dat , Movies.dat, specific to read: README.txt, the Sparkjoin instance given in this section gets a list of movies with an average score of more than 4.0 by connecting Ratings.dat and Movies.dat two files, using the dataset: Ml-1m. The program code is as follows:

Import Org.apache.spark._

Import Sparkcontext._

Object Sparkjoin {

def main (args:array[string]) {

if (args.length!= 4) {

println ("Usage is Org.test.WordCount")

Return

}

Val sc = new Sparkcontext (args (0), "WordCount",

System.getenv ("Spark_home"), Seq (System.getenv ("Spark_test_jar"))

Read rating from HDFS file

Val textfile = Sc.textfile (args (1))

Extract (MovieID, rating)

Val rating = Textfile.map (line => {

Val fileds = Line.split ("::")

(Fileds (1). ToInt, Fileds (2). ToDouble)

})

Val Moviescores = Rating

. Groupbykey ()

. Map (Data => {

Val avg = data._2.sum/data._2.size

(Data._1, Avg)

})

Read Movie from HDFS file

Val movies = Sc.textfile (args (2))

Val movieskey = Movies.map (line => {

Val fileds = Line.split ("::")

(fileds (0). ToInt, Fileds (1))

}). keyby (Tup => tup._1)

By join, we get <movie, moviename= "averagerating,=" >

Val result = Moviescores

. keyby (Tup => tup._1)

. Join (Movieskey)

. filter (f => f._2._1._2 > 4.0)

. Map (f => (F._1, f._2._1._2, f._2._2._2))

Result.saveastextfile (args (3))

}

}

You can download the code, compile the jar package and run the script from here.

This program directly use spark write some trouble, you can write directly on the Shark HQL implementation, Shark is based on spark similar to hive Interactive query engine, specific reference: Shark.

4. Summary

Spark programming is not very demanding for Scala languages, just as Hadoop programming is not very demanding in Java languages, as long as you have the most basic syntax to write programs, and there are few common syntax and expressions. Typically, a program has just begun to be modeled on official examples, including Scala, Java, and Python three language instances.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.