The spark kernel is developed by the Scala language, so it is natural to develop spark applications using Scala. If you are unfamiliar with the Scala language, you can read Web tutorials A Scala Tutorial for Java programmers or related Scala books to learn.
This article will introduce 3 Scala spark programming examples, WordCount, TOPK, and Sparkjoin, representing three typical applications of spark respectively.
1. WordCount Programming Example
WordCount is one of the simplest examples of distributed applications, the main function is to count the total number of occurrences of all the words in the input directory, and write the following steps:
Step 1: Create a Sparkcontext object that has four parameters: Spark Master location, application name, Spark installation directory, and Jar storage location, and for Spark on yarn, the first two parameters are the most important, and one is specified as " Yarn-standalone ", the second parameter is a custom string, for example: 1 2 val sc = new Sparkcontext (args (0)," WordCount ", System.getenv (" Spark_hom E "), Seq (System.getenv (" Spark_test_jar "))
Step 2: Read the input data. To read the text data from the HDFS, you can use the Textfile function in Sparkcontext to convert the input file to a RDD, which uses the Textinputformat parse input data in Hadoop, for example: 1 val Textfile = Sc.textfile (args (1))
Of course, Spark allows you to use any Hadoop inputformat, such as binary input format Sequencefileinputformat, at which point you can work with the Hadooprdd function in Sparkcontext, for example: 1 2 val Inputformatclass = Classof[sequencefileinputformat[text,text]] var Hadooprdd = Sc.hadooprdd (conf, InputFormatClass, Classof[text], Classof[text])
Or create a Hadooprdd object directly: 1 2 var hadooprdd = new Hadooprdd (SC, conf, Classof[sequencefileinputformat[text,text, classof[ Text], Classof[text])
Step 3: By RDD conversion operator operation and conversion rdd, for WordCount, first of all, we need to parse out the words from each line in the input data, then put the same words in a bucket, and then count the frequency of each word in each bucket, for example: 1 2 3 val result = Ha dooprdd.flatmap{case (key, value) = > value.tostring (). Split ("\\s+"); Map (Word = > (word, 1)). Reducebykey (_ + _)
Where the Flatmap function converts a record into multiple records (One-to-many relationships), the map function converts a record to another record (one-to-one relationship), and the Reducebykey function divides the same data into a bucket and calculates it in key units. The specific meaning of these functions can be referred to: Spark transformation.
Step 4: Save the resulting rdd dataset to the HDFs. You can use the Saveastextfile in Sparkcontext to save the dataset to the HDFs directory, default to the Textoutputformat provided by Hadoop, and print out each record in the form of "(Key,value)". You can also use the Saveassequencefile function to save data as a sequencefile format, and so on, for example: 1 Result.saveassequencefile (args (2))
Of course, when we write the spark program, we need to include the following two header files: 1 2 import Org.apache.spark. _ Import Sparkcontext. _
The WordCount complete program has been introduced in the "Apache Spark Learning: Using Eclipse to build the spark integrated development environment," which is not discussed at times.
Note that when specifying the input output file, you need to specify the URI of the HDFs, such as the input directory is hdfs://hadoop-test/tmp/input, the output directory is hdfs://hadoop-test/tmp/output, where, "HDFs ://hadoop-test "is specified by the parameter Fs.default.name in the Hadoop configuration file Core-site.xml, replaced by your configuration."
2. TOPK Programming Example
The task of the TOPK program is to count the frequency of a stack of text and return the most frequent k-words. If implemented with MapReduce, you need to write two jobs: WordCount and TOPK, and spark with just one job, where the wordcount part has been implemented by the front, and then follow the previous implementation to find top K. Note that the implementation of this article is not optimal, there is a lot of room for improvement.
Step 1: First you need to sort all the words according to Word frequency, as follows: 1 2 3 val sorted = result.map {case (key, value) = > (value, key);//exchange Key and value }.sortbykey (True, 1)
Step 2: Return the first K: 1