An example analysis of the graphical MapReduce and wordcount for the beginner Hadoop

Source: Internet
Author: User

The core design of the Hadoop framework is: HDFs and MapReduce.  HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.  HDFs is an open source implementation of the Google File System (GFS), and MapReduce is an open source implementation of Google MapReduce. The HDFs and MapReduce implementations are completely decoupled, not without HDFS, which can not be a mapreduce operation.  This article mainly refers to the following three blog learning organized. 1.Hadoop Sample Program WordCount detailed and examples2.Hadoop Learning Note: A detailed description of the MapReduce framework3.Hadoop Sample program WordCount analysis1. The whole process of MapReduceThe simplest MapReduce application contains at least 3 parts: A Map function, a Reduce function, and a main function. When running a MapReduce compute task, the task process is divided into two phases: the map phase and the reduce phase, each with a key-value pair (Key/value) as input and output. The main function combines job control and file input/output.
    • Reads the contents of the text in parallel and then makes a mapreduce operation.
  
    • Map process: Read the text in parallel, the read Word map operation, each word is generated in the form of <key,value>.

My understanding:

A file with three lines of text for a mapreduce operation.

Read the first line Hello World Bye world, dividing the word into a map.

<Hello,1> <World,1> <Bye,1> <World,1>

Read the second line Hello Hadoop Bye Hadoop, split the word to form a map.

<Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1>

Read the third line bye Hadoop Hello Hadoop and split the word to form a map.

<Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1>

  
    • The reduce operation is to sort the results of the map, merge, and finally get the word frequency.

My understanding:

After further processing (combiner), the resulting map synthesizes the value array according to the same key group.

<Bye,1,1,1> <Hadoop,1,1,1,1> <Hello,1,1,1> <World,1,1>

Cycle through reduce (k,v[]) to count the occurrences of each word separately.

<Bye,3> <Hadoop,4> <Hello,3> <World,2>

   2, WordCount source
 PackageOrg.apache.hadoop.examples;Importjava.io.IOException;ImportJava.util.StringTokenizer;Importorg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.io.IntWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.Reducer;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser;/*** Description: WordCount explains by York *@authorHadoop Dev Group*/publicclass WordCount {/*** Build Mapper class tokenizermapper inherit from generic class Mapper * Mapper class: Implements the Map function base class * Mapper Interface: * Writablecomparable interface: implementation writ Ablecomparable classes can be compared with each other.     All classes that are used as keys should implement this interface.      * Reporter can be used to report the running progress of the entire application, which is not used in this example. *      */Publicstaticclass TokenizermapperextendsMapper<object, text, text, intwritable>{        /*** intwritable, Text is a class implemented in Hadoop to encapsulate Java data types that implement the Writablecomparable interface, and * are serializable to facilitate in a distributed environment     Data exchange, you can see them as substitutes for int,string, respectively. * Declare one constant and the variable that Word uses to hold the word*/privatefinalstatic intwritable One=NewIntwritable (1); PrivateText Word =NewText (); /*** Map method in mapper: * void map (K1 key, V1 value, context context) * Maps a single input k/v to an intermediate k/v pair         * The output pair is not required and the input pair is the same type, the input pair can be mapped to 0 or more output pairs.         * Context: Collection of mapper output <k,v> pairs. * Context of write (K, V) method: Add one (k,v) to the context * programmer to write the map and the reduce function. This map function separates the strings using the StringTokenizer function, and writes the words in the Write method. The * Write method in Word deposits (word, 1) such a two-tuple into the context*/publicvoid Map (Object key, Text value, context context)throwsIOException, interruptedexception {stringtokenizer ITR=NewStringTokenizer (value.tostring ());  while(Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ());      Context.write (Word, one); }}} Publicstaticclass IntsumreducerextendsReducer<text,intwritable,text,intwritable> {    Privateintwritable result =Newintwritable (); /*** Reduce method in reducer class: * Void reduce (Text key, iterable<intwritable> values, context context) * K/v from the context in the map function, may have been further processed (combiner), also through the context output*/publicvoid Reduce (Text key, Iterable<IntWritable>values, context context)throwsIOException, interruptedexception {intSum =0;  for(intwritable val:values) {sum+=Val.get ();      } result.set (sum);    Context.write (key, result); }} publicstaticvoid Main (string[] args)throwsException {/*** Configuration:map/reduce's J Configuration class, describing the work performed by Map-reduce to the Hadoop framework*/Configuration conf=NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length!=2) {System.err.println ("Usage:wordcount <in> <out>"); System.exit (2); } Job Job=NewJob (conf, "word count");//set a user-defined job nameJob.setjarbyclass (WordCount.class); Job.setmapperclass (tokenizermapper.class);//Set the Mapper class for the jobJob.setcombinerclass (Intsumreducer.class);//set the Combiner class for the jobJob.setreducerclass (Intsumreducer.class);//set the Reducer class for the jobJob.setoutputkeyclass (Text.class);//set the key class for the job's output dataJob.setoutputvalueclass (intwritable.class);//set the value class for the job outputFileinputformat.addinputpath (Job,NewPath (Otherargs[0]));//to set the input path for the jobFileoutputformat.setoutputpath (Job,NewPath (otherargs[1]));//set output path for jobSystem.exit (Job.waitforcompletion (true)? 0:1);//Run Job  }}
3, WordCount progressive parsing
    • The method for the map function.
public void Map (Object key, Text value, Context context) throws IOException, Interruptedexception {...}

Here are three parameters, the first two object key, the Text value is the input key and value, the third argument context context this is the key and value that can record input, for example: Context.write (Word, one) In addition, the context also records the status of the map operation.

    • The method for the reduce function.
public void reduce (Text key, iterable<intwritable> values, context context) throws IOException, interruptedexception {...}

The input of the reduce function is also a key/value form, but its value is the form of an iterator iterable<intwritable> values, That is, the input of reduce is a key corresponding to a set of values of Value,reduce also have context and map of the context of the same function.

As for the logic of Computation, programmer code is required.

    • The call to the main function.

The first is:

Configuration conf = new configuration ();

The configuration is initialized prior to running the MapReduce program, which is primarily read by the MapReduce system, which includes HDFs and MapReduce, That is, when installing Hadoop configuration files such as: Core-site.xml, Hdfs-site.xml and Mapred-site.xml and so on the information in the document, some children's shoes do not understand why to do this, this is not in-depth thinking about the MapReduce computational framework, we programmers develop mapreduce just in the blanks, in the map function and reduce function to write the actual The business logic, the rest of the work is given to the MapReduce framework to operate on its own, but at least we have to tell it what to do, like where HDFs is, where the jobstracker of MapReduce is, and this information is in the configuration file under the Conf package.

The following code is:

    string[] Otherargs = new Genericoptionsparser (conf, args). Getremainingargs ();    if (otherargs.length! = 2) {      System.err.println ("Usage:wordcount <in> <out>");      System.exit (2);    }

If the statement is good understanding, that is, running the WordCount program must be two parameters, if not will be an error exit. As for the Genericoptionsparser class in the first sentence, it is used to explain the common Hadoop commands, and to set the corresponding values for the configuration object as needed, in fact, we usually do not use it in development, but let the class implement tool interface, Then the main function uses Toolrunner to run the program, and Toolrunner internally calls Genericoptionsparser.

The following code is:

    Job Job = new Job (conf, "word count");    Job.setjarbyclass (wordcount.class);    Job.setmapperclass (tokenizermapper.class);    Job.setcombinerclass (intsumreducer.class);    Job.setreducerclass (Intsumreducer.class);

The first line is to build a job, in the MapReduce framework a MapReduce task is also called a mapreduce job, and the specific map and reduce operations are task, here we build a job, Build time has two parameters, one is conf this is not exhausted, one is the name of the job.

The second line is to load programmer-written calculation programs, such as our program class name is WordCount. Here I have to make a correction, although we write the MapReduce program only need to implement the map function and reduce function, but the actual development we want to implement three classes, the third class is to configure the MapReduce how to run the map and the reduce function, To be exact is to build a job that mapreduce can perform, such as the WordCount class.

The third row and the fifth line is the load map function and the reduce function implementation class, here is a fourth row, this is loaded Combiner class, this class and the MapReduce operation mechanism, in fact, this example to remove the fourth line is not related, but the use of the fourth line theoretically run more efficiently.

The following code:

    Job.setoutputkeyclass (text.class);    Job.setoutputvalueclass (Intwritable.class);

This is the type of key/value that defines the output, which is the type of key/value that ultimately stores the resulting file on HDFs.

The final code is:

    Fileinputformat.addinputpath (Job, New Path (Otherargs[0]));    Fileoutputformat.setoutputpath (Job, New Path (Otherargs[1]));    System.exit (Job.waitforcompletion (true)? 0:1);

The first line is to build the input data file, the second line is to build the output data file, the last line if the job runs successfully, our program will exit normally.

An example analysis of the graphical MapReduce and wordcount for the beginner Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.