Apache Spark Learning: Developing spark applications using Scala language _apache

Source: Internet
Author: User
Tags scala tutorial
The spark kernel is developed by the Scala language, so it is natural to develop spark applications using Scala. If you are unfamiliar with the Scala language, you can read Web tutorials A Scala Tutorial for Java programmers or related Scala books to learn.

This article will introduce 3 Scala spark programming examples, WordCount, TOPK, and Sparkjoin, representing three typical applications of spark respectively.

1. WordCount Programming Example

WordCount is one of the simplest examples of distributed applications, the main function is to count the total number of occurrences of all the words in the input directory, and write the following steps:

Step 1: Create a Sparkcontext object that has four parameters: Spark Master location, application name, Spark installation directory, and Jar storage location, and for Spark on yarn, the first two parameters are the most important, and one is specified as " Yarn-standalone ", the second parameter is a custom string, for example: 1 2 val sc = new Sparkcontext (args (0)," WordCount ", System.getenv (" Spark_hom E "), Seq (System.getenv (" Spark_test_jar "))

Step 2: Read the input data. To read the text data from the HDFS, you can use the Textfile function in Sparkcontext to convert the input file to a RDD, which uses the Textinputformat parse input data in Hadoop, for example: 1 val Textfile = Sc.textfile (args (1))

Of course, Spark allows you to use any Hadoop inputformat, such as binary input format Sequencefileinputformat, at which point you can work with the Hadooprdd function in Sparkcontext, for example: 1 2 val Inputformatclass = Classof[sequencefileinputformat[text,text]] var Hadooprdd = Sc.hadooprdd (conf, InputFormatClass, Classof[text], Classof[text])

Or create a Hadooprdd object directly: 1 2 var hadooprdd = new Hadooprdd (SC, conf, Classof[sequencefileinputformat[text,text, classof[ Text], Classof[text])

Step 3: By RDD conversion operator operation and conversion rdd, for WordCount, first of all, we need to parse out the words from each line in the input data, then put the same words in a bucket, and then count the frequency of each word in each bucket, for example: 1 2 3 val result = Ha dooprdd.flatmap{case (key, value) = > value.tostring (). Split ("\\s+"); Map (Word = > (word, 1)). Reducebykey (_ + _)

Where the Flatmap function converts a record into multiple records (One-to-many relationships), the map function converts a record to another record (one-to-one relationship), and the Reducebykey function divides the same data into a bucket and calculates it in key units. The specific meaning of these functions can be referred to: Spark transformation.

Step 4: Save the resulting rdd dataset to the HDFs. You can use the Saveastextfile in Sparkcontext to save the dataset to the HDFs directory, default to the Textoutputformat provided by Hadoop, and print out each record in the form of "(Key,value)". You can also use the Saveassequencefile function to save data as a sequencefile format, and so on, for example: 1 Result.saveassequencefile (args (2))

Of course, when we write the spark program, we need to include the following two header files: 1 2 import Org.apache.spark. _ Import Sparkcontext. _

The WordCount complete program has been introduced in the "Apache Spark Learning: Using Eclipse to build the spark integrated development environment," which is not discussed at times.

Note that when specifying the input output file, you need to specify the URI of the HDFs, such as the input directory is hdfs://hadoop-test/tmp/input, the output directory is hdfs://hadoop-test/tmp/output, where, "HDFs ://hadoop-test "is specified by the parameter Fs.default.name in the Hadoop configuration file Core-site.xml, replaced by your configuration."

2. TOPK Programming Example

The task of the TOPK program is to count the frequency of a stack of text and return the most frequent k-words. If implemented with MapReduce, you need to write two jobs: WordCount and TOPK, and spark with just one job, where the wordcount part has been implemented by the front, and then follow the previous implementation to find top K. Note that the implementation of this article is not optimal, there is a lot of room for improvement.

Step 1: First you need to sort all the words according to Word frequency, as follows: 1 2 3 val sorted = result.map {case (key, value) = > (value, key);//exchange Key and value }.sortbykey (True, 1)

Step 2: Return the first K: 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.