Spark Getting Started map reduce a few of the best examples

Source: Internet
Author: User
Tags pack scala ide
Installing the Scala IDE

It is easy to build a Scala language development environment, the Scala IDE website downloads the appropriate version and unzip it to complete the installation, the version used in this article is 4.1.0. Installing the Scala Language pack

If the downloadable Scala IDE's own Scala language pack does not match the Scala version (2.10.x) used by Spark 1.3.1, you will need to download the version that matches the Spark used in this article to ensure that the Scala program that you implement is not shipped with a version problem Row failed.

Please download and install the Scala version 2.10.5 Installation JDK

If you do not have a JDK installed on your machine, download and install the JDK above version 1.6. Create and configure Spark Engineering

Open the Scala IDE and create a Scala project called Spark-exercise. Figure 1. Create Scala Project

Create a Lib folder under the Engineering directory and copy the spark-assembly jar package under your Spark installation package to the Lib directory. Figure 2. Spark Development Jar Pack

and add the jar package to the project's Classpath and configure the project to use the Scala 2.10.5 version just installed. The engineering directory structure is as follows. Figure 3. Add jar packages to classpath

Back to the first running environment introduction

In order to avoid the readers ' confusion about the environment of this case, this section makes a brief introduction to the basic situation of the cluster environment used in this article. The environment for all instance data storage in this article is a 8-machine Hadoop cluster with a total filesystem capacity of 1.12t,namenode called hadoop036166, and a service port of 9000. The reader may not care about the specific node distribution, as this does not affect your reading of the following article. The Spark cluster used by this article to run an instance program is a cluster of four-node Standalone patterns that contains a Master node (listening on port 7077) and three Worker nodes, which are distributed as follows:

Role
Server Name
hadoop036166 Master
hadoop036187 Worker
hadoop036188 Worker
hadoop036227 Worker
Spark provides a web UI to view cluster information and monitor execution results, the default address is: http://<spark_master_ip>:8080, we can also go to the Web page to view the results of the execution after the instance is submitted, Of course, you can also find the execution results by looking at the log. Figure 4. Spark Web Console

Back to the top case analysis and programming implementation case a

A. Case description

Mention word count (frequency count), I believe that everyone is not unfamiliar, is to count one or more files in the number of words appear. This article takes this as an entry-level case to open the door to using Scala to write Spark large data-processing programs.

b Case analysis

For frequency statistics, with the operator provided by Spark, we first need to convert each line in the text file into a single word, followed by each occurrence of the word for a number of times, and finally the count of all the same words added to get the final result.

For the first step, we naturally think of using the flatmap operator to split a line of text into multiple words, and then for the second step we need to use the map operator to convert a single word into a count key-value pair, that is, Word-> (word,1). For the last step to count the occurrences of the same word, we need to use the Reducebykey operator to add the count of the same word to the final result.
C. Programming implementation List 1.SparkWordCount class source code

Import org.apache.spark.SparkConf Import org.apache.spark.SparkContext import Org.apache.spark.sparkcontext._ Object
 Sparkwordcount {def file_name:string = "Word_count_results_";
 def main (args:array[string]) {if (Args.length < 1) {println ("Usage:sparkwordcount FileName");
 System.exit (1);
 Val conf = new sparkconf (). Setappname ("Spark exercise:spark Version Word Count program");
 Val sc = new Sparkcontext (conf);
 Val textfile = sc.textfile (args (0)); Val wordcounts = Textfile.flatmap (line => line.split ("")). Map (Word => (wor
 D, 1)). Reducebykey ((A, B) => A + b)//print the Results,for debug use.
 println ("Word Count program running results:");
 Wordcounts.collect (). foreach (e => {//val (k,v) = e//println (+ + "=" +v)//});
 Wordcounts.saveastextfile (File_name+system.currenttimemillis ());
 println ("Word Count program running results are successfully saved."); }
}

D. Submitting to cluster execution

In this example, we will count the frequency of all txt files in the HDFS file system in the/user/fams directory. The Spark-exercise.jar is a packaged jar package for the Spark project, which is uploaded to the target server's/home/fams directory when it is executed. The specific commands to run this instance are as follows: list 2.SparkWordCount class execute command

./spark-submit \
--class com.ibm.spark.exercise.basic.SparkWordCount \
--master spark://hadoop036166:7077 \
--num-executors 3 \
--driver-memory 6g--executor-memory 2g \
--executor-cores 2 \
/home/fams/ Sparkexercise.jar \
Hdfs://hadoop036166:9000/user/fams/*.txt

E. Monitoring implementation status

This instance stores the final results on the HDFS, so if the program works properly we can find the generated file information on HDFS 5. Case One output result

Open the Web UI for the Spark cluster to see the results of the job that you just submitted. Figure 6. Case One completion status

If the program is not finished, then we can find it in the Running applications list. Case Two

A. Case description

In this case, we'll assume that we need to count the average age of 10 million people, and of course, if you want to test Spark's ability to handle large data, you can put a larger population, such as 100 million people, depending on the storage capacity of the cluster used for testing. Suppose these age information is stored in a file, and the file is formatted as follows, the first column is ID, and the second column is age. Figure 7. Case two test data format preview

Now we need to use Scala to write a 10 million population age data file, the source program is as follows: listing 3. Age information file generation class source code

Import java.io.FileWriter
 import java.io.File
 import scala.util.Random

 object Sampledatafilegenerator {
 
 def main (args:array[string]) {
 val writer = new FileWriter (New File ("C: \\sample_age_data.txt"), false)
 Val rand = new Random () for
 (I <-1 to 10000000) {
 writer.write (i + "" + Rand.nextint ())
 Writer.write (System.getproperty ("Line.separator"))
 }
 Writer.flush ()
 writer.close ()
 }
 

B. Case studies

To calculate the average age, you first need to deal with the source file corresponding to the RDD, that is, to convert it into a RDD that contains only age information, and then the number of elements to be counted, then add up all ages, the last average age = Total age/number.

For the first step we need to use the map operator to map the corresponding RDD of the source file into a new RDD that contains only age data, it is obvious that we need to use the Split method in the incoming function of the map operator to get the second element after the array, which is the age information. The second step calculates the total number of data elements you need to RDD the count operator for the results of the first step mapping; The third step is to use the reduce operator to sum all the elements of a RDD containing only age information, and finally use Division to calculate the average age.

Because the output of this example is simple, it can only be printed on the console.

C. Programming implementation List 4.AvgAgeCalculator class source code

Import org.apache.spark.SparkConf Import Org.apache.spark.SparkContext object Avgagecalculator {def main (args:array[ String] {if (Args.length < 1) {println ("Usage:avgagecalculator datafile") system.exit (1)} val conf = new SPARKC Onf (). Setappname ("Spark exercise:average Age Calculator") val sc = new Sparkcontext (conf) val datafile = Sc.textfile (arg
 S (0), 5); Val count = Datafile.count () val agedata = Datafile.map (line => line.split ("") (1)) Val totalage = Agedata.map (age =& Gt Integer.parseint (string.valueof)). Collect (). reduce ((a,b) => a+b) println ("Total Age: "+ Totalage +"; Number of people: "+ count" val avgage:double = totalage.todouble/count.todouble println ("Average age is" + avgage )
 }
}

D. Submitting to cluster execution

To execute the program for this instance, you need to upload the age information file that you just generated to HDFS, assuming you have just executed the Scala class that generated the age information file on the target machine, and the file is generated in the/home/fams directory.

Then you need to run the HDFS command to copy the files to the HDFS/user/fams directory. Listing 5. command to copy the age information file to the HDFS directory

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.