A detailed description of the MapReduce process (using WordCount as an example)

Source: Internet
Author: User

This article turns from http://www.cnblogs.com/npumenglei/

....

First create two text files as input to our example:

File 1 Content:

My name is Tony

My Company is pivotal

File 2 Content:

My name is Lisa

My Company is EMC

1. The first step, Map

As the name implies, Map is disassembly.

First of all, our input is two files, by default is two split, corresponding to the previous image of Split 0, split 1

Two split by default will be assigned to two mapper to deal with, the WordCount example is quite violent, this step is to directly break down the contents of the file into words and 1 (note, not the specific number, that is, the number 1) the word is our main health, also known as key, the following number is the corresponding value , also known as value.

Then the output of the corresponding two mapper is:

Split 0

My 1

Name 1

is 1

Tony 1

My 1

Company 1

is 1

Pivotal 1

Split 1

My 1

Name 1

is 1

Lisa 1

My 1

Company 1

is 1

EMC 1

2. Partition

What is Partition? Partition is the partition.

Why partition? Because there are sometimes multiple reducer, partition is to process the input in advance and partition it according to the future reducer. By the time reducer processing, only need to deal with their own data can be.

How to partition? The main partitioning method is to separate the data according to the key, one of the important thing is to ensure the uniqueness of the key, because in the future do reduce can be done at different nodes, if a key at the same time on the two nodes, reduce the result will be a problem, So the common partition method is hashing.

With our example, we assume that there are two reducer, and the first two split finishes partition the result will be as follows:

Split 0

Partition 1:
Company 1
is 1
is 1


Partition 2:
My 1
My 1
Name 1
Pivotal 1
Tony 1

Split 1

Partition 1:
Company 1
is 1

is 1
EMC 1


Partition 2:
My 1
My 1
Name 1
Lisa 1

Where Partition 1 will be prepared for Reducer 1 treatment, Partition 2 is to reducer 2

Here we can see that Partition just put all the entries in the key section, no other processing, the key in each zone will not appear in another area.

3. Sort

Sort is sorted, in fact, this process in my view is not necessary, can be given to the customer's own program to deal with. Then why do they have to be sorted? Perhaps it was written in MapReduce that "most of the reduce programs should want to enter the data that has been sorted by key, and if so, then we can simply help you out, please call me Lei Feng!" ” ...... All right, you are Lei Feng.

So let's assume that the previous data is sorted again, and the results are as follows:

Split 0

Partition 1:
Company 1
is 1
is 1


Partition 2:
My 1
My 1
Name 1
Pivotal 1
Tony 1

Split 1

Partition 1:
Company 1
EMC 1
is 1

is 1

Partition 2:
Lisa 1
My 1
My 1
Name 1

As you can see here, the entries in each partition are sorted in the order of key.

4. Combine

What is combine? Combine can actually be understood as a mini reduce process, which occurs after the output of the previous map, the purpose is to send the results to reducer before the first calculation, in order to reduce the size of the file, convenient for subsequent transmission. But this is not a necessary step.

Follow the previous output to execute the Combine:

Split 0

Partition 1:
Company 1
is 2

Partition 2:
My 2
Name 1
Pivotal 1
Tony 1

Split 1

Partition 1:
Company 1
EMC 1
is 2

Partition 2:
Lisa 1
My 2
Name 1

We can see that for the previous output, we have partially counted the frequency of the IS and my, reducing the size of the output file.

5. Copy

Now we're ready to send the output to reducer. This stage is called copy, but in fact Boanerges thinks it is more appropriate to call him download, because when implemented, it is through HTTP that the reducer node downloads data belonging to its own partition to each mapper node.

So according to the previous partition, the results of the download are as follows:

Reducer Node 1 contains a total of two files:

Partition 1:
Company 1
is 2

Partition 1:

Company 1

EMC 1

is 2

Reducer Node 2 is also a two file:

Partition 2:

My 2
Name 1
Pivotal 1
Tony 1

Partition 2:

Lisa 1

My 2

Name 1

As you can see here, with copy, the same partition data falls on the same node.

6. Merge

As shown in the previous step, the files that reducer get are downloaded from different mapper and need to be merged into one file, so the following is the merge and the result is as follows:

Reducer Node 1

Company 1
Company 1
EMC 1

is 2
is 2

Reducer Node 2

Lisa 1
My 2
My 2

Name 1
Name 1

Pivotal 1

Tony 1

7. Reduce

Finally, we can make the final reduce ... This step is quite simple, according to the content of each file to do a final statistic, the results are as follows:

Reducer Node 1

Company 2
EMC 1

is 4

Reducer Node 2

Lisa 1
My 4

Name 2

Pivotal 1

Tony 1

That's it! We successfully counted the number of each word in the two files, and deposited them in two output files, the two output files are the legendary part-r-00000 and part-r-00001, look at the contents of two files, and then look back to the first partition, Should be clear of the mystery of it.

If you run the WordCount in your own environment only part-r-00000 a file, it should be because you are using the default settings, the default one job only one reducer

If you want to set up two, you can:

1. Add Job.setnumreducetasks (2) to the source code and set the reducer for this job to two
Or
2. Set the following parameters in the Mapred-site.xml and restart the service
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>

This way, the entire cluster will use two reducer by default

Conclusion:

This article outlines the entire process of mapreduce and what is done at each stage, and does not involve specific job,resource management and control, because that is the main difference between the first-generation MapReduce framework and the yarn framework. The two-generation framework of the above-mentioned MapReduce principle is similar, I hope to be helpful to everyone.

A detailed description of the MapReduce process (using WordCount as an example)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.