The first mapreduce application: wordcount

Last Update:2018-12-06 Source: Internet

Author: User

Tags map class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mapreduce uses the "divide and conquer" idea to distribute operations on large-scale datasets to each shard node under the master node management, and then integrates the intermediate results of each node, get the final result. In short, mapreduce is "the decomposition of tasks and the summary of results ".

In hadoop, there are two machine roles for executing mapreduce tasks: jobtracker and tasktracker. jobtracker is used for scheduling and tasktracker is used for execution. A hadoop cluster has only one jobtracker.

In distributed computing, the mapreduce framework handles complex issues such as distributed storage, Job Scheduling, load balancing, fault tolerance balancing, fault tolerance Processing, and network communication in parallel programming, the processing process is highly abstracted into two functions: map and reduce. MAP is responsible for dividing the task into multiple tasks, and reduce is responsible for summarizing the results of multi-task processing after the decomposition.

Note that datasets (or tasks) processed by mapreduce must have the following features: datasets to be processed can be divided into many small datasets, in addition, each small dataset can be processed completely in parallel.

I. handling process

In hadoop, each mapreduce task is initialized as a job, and each job can be divided into two stages: Map stage and reduce stage. The two stages are represented by two functions, map function and reduce function. The map function receives an input in the form of <key, value>, and then generates an intermediate output in the form of <key, value>. The hadoop function receives an input such as <key, (list of values)> input, and then process the value set. Each reduce generates 0 or 1 output, and the reduce output is in the form of <key, value>.

Ii. Preparations: create two simulated files on hadoop

Put these two files in the HDFS directory:

3. Write mapreduce code

package hadoop;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import mapreduce.WordCount;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;public class LibinWordCount {    public static class Map extends MapReduceBase implements            Mapper<LongWritable, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        private Text word = new Text();        public void map(LongWritable key, Text value,                OutputCollector<Text, IntWritable> output, Reporter reporter)                throws IOException {            String line = value.toString();            StringTokenizer tokenizer = new StringTokenizer(line);            while (tokenizer.hasMoreTokens()) {                word.set(tokenizer.nextToken());                output.collect(word, one);            }        }    }    public static class Reduce extends MapReduceBase implements            Reducer<Text, IntWritable, Text, IntWritable> {        public void reduce(Text key, Iterator<IntWritable> values,                OutputCollector<Text, IntWritable> output, Reporter reporter)                throws IOException {            int sum = 0;            while (values.hasNext()) {                sum += values.next().get();            }            output.collect(key, new IntWritable(sum));        }    }    public static void main(String[] args) throws Exception {        JobConf conf = new JobConf(LibinWordCount.class);        conf.setJobName("LibinWordCount");        conf.setOutputKeyClass(Text.class);        conf.setOutputValueClass(IntWritable.class);        conf.setMapperClass(Map.class);        conf.setCombinerClass(Reduce.class);        conf.setReducerClass(Reduce.class);        conf.setInputFormat(TextInputFormat.class);        conf.setOutputFormat(TextOutputFormat.class);        FileInputFormat.setInputPaths(conf, new Path(args[0]));        FileOutputFormat.setOutputPath(conf, new Path(args[1]));        JobClient.runJob(conf);    }}

First, we will explain the job initialization process. The main function calls the jobconf class to initialize the mapreduce job, and then calls the setjobname () method to name the job. Naming a job helps you locate the job more quickly and monitor it on the jobtracker and tasktracker pages.

Jobconf conf = new jobconf (libinwordcount. Class); Conf. setjobname ("libinwordcount ");

Set the Data Types of keys and values in the <key, value> output result of the job. Because the result is <word, number>, the key is set to "text, it is equivalent to the string type in Java. Set Value to "intwritable", which is equivalent to the int type in Java.

Conf. setoutputkeyclass (text. Class );

Conf. setoutputvalueclass (intwritable. Class );

Set the map (split), combiner (merge intermediate results), and reduce (merge) Processing classes for job processing. Here we use the reduce class to merge the intermediate results generated by map to avoid pressure on network data transmission.

Conf. setmapperclass (Map. Class );

Conf. setcombinerclass (reduce. Class );

Conf. setreducerclass (reduce. Class );

Next, call setinputpath () and setoutputpath () to set the input and output paths.

Conf. setinputformat (textinputformat. Class );

Conf. setoutputformat (textoutputformat. Class );

(1) inputformat and inputsplit

Inputsplit is the data defined by hadoop for transmitting to each individual map. inputsplit stores not the data itself, but an array of shard lengths and records data locations. The inputsplit generation method can be set through inputformat.

When data is transmitted to the map, the map will send the input parts to the inputformat, And the inputformat will call the getrecordreader () method to generate a recordreader, And the recordreader will then generate a recordreader through creatkey (), creatvalue () method To create a <key, value> pair for map processing. In short, the inputformat () method is used to generate the <key, value> pair for map processing.

Textinputformat is the default input method of hadoop. In textinputformat, each file (or part of it) is used as the map input separately, which is inherited from fileinputformat. Then, a record is generated for each row of data, and each record is expressed in the form of <key, value>: the key value is the byte offset of each data record in the Data shard, the data type is longwritable; the value is the content of each row, and the data type is text.

(2) outputformat

Each input format corresponds to an output format. The default output format is textoutputformat. This output mode is similar to that of the input.
Store text files. However, its key and value can be in any form, because the program content will call the tostring () method to convert the key and value to the string type and then output.

(3) Map Method Analysis in map class

The map class inherits from mapreducebase and implements the ER interface. This interface is a standard type and has four forms of parameters, it is used to specify the Input key value type, input value type, output key value type, and output value type of map respectively. In this example, because textinputformat is used, its output key
The value is of the longwritable type, and the output value is of the text type. Therefore, the input type of map is <longwritable, text>. In
In this example, <word, 1> is output. Therefore, the output key value type is text, and the output value type is intwritable.

To implement this interface class, you also need to implement the map method. The map method performs operations on the input. In this example, the map method splits the input rows in spaces, use outputcollect to collect the output <word, 1>. The reduce class also inherits from mapreducebase and must implement the CER interface. Reduce class uses map output as input, so reduce input class
Type is <text, intwritable>. The output of reduce is a word and its number. Therefore, its output type
Yes <text, intwritable>. The reduce class also needs to implement the reduce method. In this method, the reduce function uses the Input key value as the output
And then add the obtained values as output values. The stringtokenizer constructor is empty and is separated by spaces by default.

(4) Reduce Method Analysis in reduce class
The reduce class also inherits from mapreducebase and must implement the CER interface. Reduce class uses map output as input, so reduce input class
Type is <text, intwritable>. The output of reduce is a word and its number. Therefore, its output type
Yes <text, intwritable>. The reduce class also needs to implement the reduce method. In this method, the reduce function uses the Input key value as the output
And then add the obtained values as output values.

4. compress the mapreduce program into a jar package

Use-D.. you can obtain the directory structure of the current Java class path automatically and compress it into a jar package.

5. Run the mapreduce Program

Hadoop. libinwordcount is the class path containing the package name, input20120828 is the input directory, and output20120828 is the output directory. The complete information is as follows:

View the output file results:

The final result is:

The hadoop command starts a JVM to run the mapreduce program, automatically obtains the hadoop configuration, and adds the class path (and its dependency) to the hadoop library. The above is the running record of the hadoop job. From here, we can see that this job is assigned an idnumber job_2012022921__0002, and that there are two input files (total input paths to process: 2 ), you can also understand the input and output records of map (number of records and number of bytes) and reduce input and output records. For example, in this example, the number of map tasks is two, and the number of reduce tasks is one. MAP has two input records and four output records.

6. mapreduce processing process

1) split the file into splits. Because the files used for testing are small, each file is split and the file is split by line to form a <key, value> pair ,. This step is automatically completed by the mapreduce framework. The offset (that is, the key value) includes the number of characters that the carriage return occupies (which may vary in Windows and Linux environments ).

2) process the split <key, value> pairs to the User-Defined map method and generate a new <key, value> pair.

3) after the <key, value> pairs output by the map method are obtained, mapper sorts them by key value and executes the combine process to accumulate keys to the same value, obtain the final output result of mapper.

4) reducer first sorts the data received from Mapper, and then submits the data to the custom reduce method to obtain a new <key, value> pair, and as the output result of wordcount.

VII. New Java APIs of mapreduce

The mapreduce release 0.20.0 API in the latest version of hadoop includes a brand new mapreduce Java API, which is also called a context object. The new API type is not compatible with the previous API. Therefore, the previous application must be rewritten to make the new API play its role. There are several obvious differences between new APIs and old APIs:

New APIs tend to use abstract classes instead of interfaces, because they are easier to expand. For example, you can add a method (implemented by default) to an abstract class without modifying the implementation method before the class. In the new API, Mapper and CER are abstract classes.

The new API is in the org. Apache. hadoop. mapreduce package (and sub-package. Earlier versions of APIs are stored in org. Apache. hadoop. mapred.

The new API extensively uses context object and allows user code to communicate with the mapreduce system. For example, mapcontext basically acts as outputcollector and reporter of jobconf.

The new API supports both "push" and "pull" iterations. In the two old and new APIs, key/value record pairs are pushed to Mapper, but in addition, the new API allows pulling records from the map () method, which is also applicable to Cer. A useful example of the pull type is to process records in batches, rather than one by one.

The configuration of the new API is unified. The old API has a special jobconf object for job configuration, which is an extension of hadoop's common configuration object. In the new API, there is no such difference, so the job configuration is completed through configuration. The execution of Job control is the responsibility of the job class, rather than the jobclient. It is no longer stored in the new API.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More