Analyzing the MapReduce execution process

Last Update:2016-07-22 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Analyzing the MapReduce execution process

When MapReduce runs, it reads the data files in HDFs through the Mapper run task, and then calls its own method, processes the data, and outputs it. The reducer task receives the data output from the Mapper task as its input data, calls its own method, and finally outputs it to the HDFs file.

Mapper the execution process of a task

each mapper the task is a Java process , it will read the files in HDFs, parse into a lot of key-value pairs, after we have covered the map method of processing, converted to a lot of key-value pairs re-output. The process of the whole mapper task can be divided into the following stages.

The operation of the mapper task is divided into six stages.

The first stage is the input file according to a certain standard Shard (inputsplit), the size of each input piece is fixed. By default, the size of the input slice (inputsplit) is the same as the size of the data block (block). If the size of the block (block) is the default value of 64MB, the input file has two, one is 32MB, and the other is 72MB. So the small file is an input piece, the large file will be divided into two pieces of data, then two input pieces. Altogether three input slices are produced. Each input piece consists of a Mapper process processing . Here are three input pieces, there will be three mapper process processing.
The second stage is to parse the records in the input slices into key-value pairs according to certain rules. A default rule is to parse each line of text into a key-value pair. The "key" is the starting position (in bytes) of each row, and the value is the text content of the bank.
The third stage is to call the map method in the Mapper class. The second phase resolves each key-value pair, calling a map method. If there are 1000 key-value pairs, the map method is called 1000 times. Each time the map method is called, 0 or more key-value pairs are output.
The Forth stage is to partition the key-value pairs of the third stage output according to certain rules. The comparison is based on the key. For example, our key indicates provinces (such as Beijing, Shanghai, Shandong, etc.), then can be divided according to different provinces, the same province of the key-value pairs into a region. The default is only one zone . the number of partitions is Reducer the number of task runs . There is only one reducer task by default.
The fifth stage is to sort the key-value pairs in each partition. First, sort by key, and for key-value pairs with the same key, sort by value. For example, three key values for <2,2>, <1,3>, <2,1>, and keys and values are integers respectively. Then the result of sorting is <1,3>, <2,1>, <2,2>. If there is a sixth stage, then the sixth stage is entered, and if not, the output is directly to the local Linux file.
The sixth stage is the processing of data, that is, reduce processing. Key -value pairs with equal keys are called once Reduce method . By this stage, the amount of data will be reduced. The data is output to a local Linxu file. This stage is not the default and requires the user to add the code for this phase themselves .

Reducer the execution process of a task

Each reducer task is a Java process. The reducer task receives the output of the mapper task, which is written to HDFs after processing, and can be divided into the following stages.

The first stage is that the reducer task will proactively replicate its output key-value pairs from the mapper task. Mapper tasks can be many, so reducer copies the output of multiple mapper.
The second stage is to merge all the data that is copied into the reducer, merging the scattered data into one large data. Then sort the merged data.
The third stage is to call the reduce method on the sorted key-value pair. Key -value pairs with equal keys are called once Reduce method , each call produces 0 or more key-value pairs. Finally, these output key-value pairs are written to the HDFs file.

Throughout the development of the MapReduce program, our greatest effort was to override the map function and the overwrite reduce function.

Number of key-value pairs

In the mapper task, reducer task analysis process, you will see a lot of stages have a key value pair, the reader is easy to confuse, so here the key value pairs are numbered to facilitate understanding of the change of key value pairs

For mapper task input key-value pairs, defined as Key1 and value1. After processing in the map method, the output key-value pairs are defined as Key2 and value2. The reduce method receives Key2 and value2, and after processing, outputs Key3 and value3. In the following discussion of key-value pairs, key1 and value1 may be abbreviated to <K1,V1>,KEY2 and value2 abbreviated to <k2,v2>,key3 and value3 shorthand for <k3,v3>.

Example: Word Count

The business requires statistics on the number of occurrences of all words in the specified file.

The content is simple, two lines of text, each line of words in the middle use a space to distinguish.

Analytical thinking: The most intuitive idea is to use a data structure map. Parse each word that appears in the file, using the word as the key, and the number of occurrences as value. There is no problem with this idea, but it is not in the big Data environment. We need to use MapReduce to do that. According to the mapper task and the run phase of the reducer task, we know that the second stage of the Mapper task is to convert each line of the file into a key-value pair, then the third-stage map method can obtain each text content, we can in the map method to count the number of words in the word, Output the number of occurrences of each word as a new key value pair. In the second phase of the reducer task, the key value pairs that are output by the Mapper task are sorted by key, and key-value pairs that are equal are called the reduce method. Here, "key" is the word, "value" is the number of occurrences. As a result, all occurrences of a word in a non-peer can be added in the reduce method, resulting in the total number of occurrences of the word. Finally output this result.

Take a look at how to override the Map method

Static classMymapperextendsmapper<longwritable, text, text, intwritable>{//Key2 represents the word in the lineFinalText Key2 =NewText ();//value2 Indicates the number of occurrences of a word in that lineFinalIntwritable value2 =NewIntwritable (1);//key indicates the starting position of the line of text, which is also the offset//value represents the line of textprotected voidMap (longwritable key, Text value, context context)throwsjava.io.IOException, interruptedexception {Finalstring[] splited = value.tostring (). Split (""); for(String word:splited) {key2.set (word);//write Key2, value2 to the contextContext.write (Key2, value2);}};}

In the above code, note that generics of the Mapper class are not basic Java types, but Hadoop data types longwritable, Text, intwritable. The reader can simply be equivalent to the Java class Long, String, Int. The following is a data type specifically for Hadoop.

The generics of the Mapper class in the code are in turn <k1,v1,k2,v2>. The second parameter of the map method is the textual content, which is our concern. The core code is to split the text of the line according to the space, each word as a new key, the value 1 as a new value, written to the context. Here, because each word is output, the number of occurrences is constant 1. If two Hello is included in a line of text, it will output two times

Look again at how to override the Reduce method

Static classMyreducerextendsReducer<text, Intwritable, Text, intwritable>{//VALUE3 Indicates the total number of occurrences of a wordFinalIntwritable Value3 =NewIntwritable (0);/*** Key means the word * values represent a collection of 1 of the map method output * Context contextual object*/protected voidReduce (Text key, iterable<intwritable> values, context context)throwsException {intsum = 0; for(intwritable count:values) {sum+=count.get ();}//execution Here, sum indicates the total number of occurrences of the word//Key3 represents the word, which is the last key to outputFinalText Key3 =key;//VALUE3 Indicates the total number of occurrences of a word, which is the value of the last outputvalue3.set (sum); Context.write (Key3, value3);};}

In the above code, the four generics of the Reducer class are sequentially <k2,v2,k3,v3>, and note that the second parameter of the reduce method is the java.lang.Iterable type, and the iteration is v2. That is K2 the same v2 can be iterated out.

These are the map methods we covered and the reduce method. Now to run our code, we need to write the driver code, as follows

/*** Driver Code*/ Public Static voidMain (string[] args)throwsIOException, Interruptedexception, classnotfoundexception {//Input PathFinalString Input_path = "Hdfs://hadoop0:9000/input";//output path, must be non-existentFinalString Output_path = "Hdfs://hadoop0:9000/output";//Create a Job object that encapsulates all the information required by the runtimeFinalJob Job =NewJob (NewConfiguration (), "Wordcountapp");//If you need to run a jar, you need the following sentenceJob.setjarbyclass (Wordcountapp.class);//tells the job to enter the path to the file when it executes the jobfileinputformat.setinputpaths (Job, input_path);//set the class that processes input files into key-value pairsJob.setinputformatclass (Textinputformat.class);//to set a custom mapper classJob.setmapperclass (mymapper.class);//sets the type of K2, V2 for the map method outputJob.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (intwritable.class);//to set a class on a K2 partitionJob.setpartitionerclass (Hashpartitioner.class);//set the number of reducer tasks to runJob.setnumreducetasks (1);//to set a custom reducer classJob.setreducerclass (myreducer.class);//set the type of K3, V3 for the output of the reduce methodJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (intwritable.class);//tells the job when the job executes the output pathFileoutputformat.setoutputpath (Job,NewPath (Output_path));//indicates the K3 type of the outputJob.setoutputkeyclass (Text.class);//indicates the V3 type of the outputJob.setoutputvalueclass (intwritable.class);//let the job run until the end of the run and the program exitsJob.waitforcompletion (true);}

In the above code, we created a job object that encapsulates our task and can be submitted to Hadoop to run independently. The last sentence, Job.waitforcompletion (true), means that the job object is submitted to Hadoop to run until the job is finished running.

There are two ways to run the code above, one to run in the eclipse environment of the host, and one to run as a jar package on Linux.

The first mode of operation requires the host to access Linux, and for the input path and the hostname in the output path hadoop0, to be in the host Hosts file binding, the author's Hosts file is located in C:\WINDOWS\system32\drivers\ Etc folder.

The second mode of operation, you need to make the code into a jar package, execute commands under Linux Hadoop jar Xxx.jar run

After the run is finished, the file path is hdfs://hadoop0:9000/output/part-r-00000.

Analyzing the MapReduce execution process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More