Explain the principle of map/reduce with easy-to-understand plain English

Source: Internet
Author: User
Tags chop mixed split
About Hadoop

Hadoop is an open source system that implements Google's cloud computing system, including parallel computing model Map/reduce, Distributed File System HDFs, and distributed database HBase, along with a wide range of Hadoop related projects, including Zookeeper,pig, Chukwa,hive,hbase,mahout,flume and so on.

Here's a detailed breakdown of the concept in this article to let you know exactly what Hadoop is:



1. What is Map/reduce, look at the various explanations below:

(1) MapReduce is one of the core components of Hadoop, the distribution of Hadoop to include two parts, one is the Distributed File System HDFs, a distributed computing box, is MapReduce, indispensable, that is, Distributed computational programming can easily be done on the Hadoop platform via MapReduce.

(2) MapReduce is a programming model, a programming method, an abstract theory.

(3) Here's a talk about how a programmer is a wife explaining what a mapreduce is. The article is very long please look patiently.

I asked my wife, "you really want to figure out what a mapreduce is." She replied firmly, "yes." So I asked:

Me: How did you prepare the onion chili sauce? (The following is not an accurate recipe, do not try at home)

Wife: I'll take an onion, chop it up, then mix in salt and water, and then grind it into a hybrid grinder. So you can get the onion chili sauce.


Wife: But what does this have to do with MapReduce?

Me: You wait a minute. Let me make a complete plot so that you can understand mapreduce within 15 minutes.

Wife: All right.

Me: Now, suppose you want a bottle of mixed chili sauce with mint, onion, tomato, chili, garlic. What would you do?

Wife: I'll take a pinch of mint leaves, one onion, one tomato, one chili, one garlic, chopped Salt and water, then put in a mixing grinder to grind, so you can get a bottle of mixed chili sauce.

Me: Yes, let's apply the concept of mapreduce to recipes. Map and reduce are actually two kinds of operations, I'll give you a detailed explanation.
Map: Chopping onions, tomatoes, peppers and garlic is a map operation that acts on each of these objects. So if you give map an onion, the map will chop the onion. Similarly, you can get chili, garlic and tomato one by one to map, and you will have a variety of pieces. So, when you're cutting vegetables like onions, you're doing a map operation. The map operation is suitable for each vegetable, which produces one or more fragments accordingly, and in our case the vegetable blocks are produced. In the map operation there may be a situation where an onion is broken and you just lose the bad onion. So, if a bad onion is present, the map operation will filter out the bad onion without producing any bad onion blocks.


Reduce (reduction): At this stage, you can get a bottle of chili sauce by putting all the vegetables into the grinder for grinding. This means that to make a bottle of chili sauce, you have to grind all the ingredients. As a result, the grinder usually aggregates the vegetable pieces of the map operation.

Wife: So, this is MapReduce?

Me: You can say yes, or you can say no. In fact, this is only part of the MapReduce, and the power of MapReduce lies in distributed computing.

Wife: Distributed computing. What is that. Please explain it to me.

Me: no problem.

Me: Let's say you took a chili sauce competition and your recipe won the best Chili Sauce award. After the award, the chili sauce recipes are popular, so you want to start selling homemade chili sauce. Suppose you need to produce 10000 bottles of chili sauce every day, what will you do?

Wife: I will find a supplier who can provide me with a lot of raw materials.

Me: Yes. That's the way it is. Can you finish the production by yourself? In other words, the raw materials are chopped up alone. Whether the only one grinder can meet the needs of the machine. And now, we also need to supply different kinds of chili sauce, such as onion chili sauce, green pepper chili sauce, tomato chili sauce and so on.

Wife: Of course not, I will hire more workers to cut vegetables. I also need more grinders so that I can produce chili sauce more quickly.
Me: Yes, so now you have to assign a job, you will need several people to cut vegetables together. Everyone has to deal with a bag full of vegetables, and each person is equivalent to performing a simple map operation. Each person will continue to take out the vegetables from the bag and dispose of only one vegetable at a time, that is, to chop them up until the bag is empty.
In this way, when all the workers are finished, the work station (where everyone works) has onion blocks, tomato blocks, and garlic, and so on.

Wife: But how can I make different kinds of tomato sauce?

Me: Now you'll see the phase of the MapReduce omission-the stirring phase. MapReduce mixes all the vegetables that have been exported, all of which are produced under the key-based map operation. The stirring will be done automatically, and you can assume that key is the name of a raw material, just like an onion. So the whole onion keys will be stirred together and transferred to the grinder that grinds the onion. In this way, you can get onion chili sauce. In the same way, all tomatoes are transferred to the grinder labeled with the tomato, and the tomato chili sauce is made.

(4) The above is a theoretical explanation of what is mapreduce, so we understand this problem in the context of the process and code of the MapReduce generation.
If you want to count the last few words in the past 10 years of computer paper, and see what everyone is doing, then what to do when the paper is collected.

Method One:
I can write a small program, all the papers in order to traverse through, count the number of occurrences of each word encountered, and finally you can know which words are most popular. This method is very effective in comparing the hours of the data set, and is the easiest to implement, and it is appropriate to solve the problem.

Method Two:
Write a multithreaded program and traverse the paper concurrently.
This problem can theoretically be highly concurrent, because counting a file does not affect the statistics of another file. When our machines are multicore or multiprocessor, method two is certainly more efficient than method one. But writing a multithreaded program is much more difficult than the method, and we have to synchronize the data ourselves, such as to prevent two threads from repeating the statistics file.

Method Three:
Give the homework to multiple computers to complete.
We can use the method one program, deploy to n machines, and then divide the papers into n parts, a machine to run a job. This method runs fast enough, but the deployment is cumbersome, we have to manually copy the program to another machine, to manually separate the proceedings, the most painful thing is to integrate the results of the N run (of course, we can also write a program).

Method Four:
Let MapReduce come and help us.

MapReduce is essentially a method of three, but how to split the file set, how to copy the program, how to integrate the results these are framework-defined. All we have to do is define this task (user program) and give it to MapReduce.


Map function and reduce function


The map function and the reduce function are given to the user, and the two functions define the task itself.

Map function: Accepts a key-value pair (Key-value pair), producing a set of intermediate key-value pairs. The MapReduce framework passes the same value of the middle key value generated by the map function to a reduce function.

Reduce function: Accepts a key, together with a related set of values, that merges the set of values to produce a smaller set of values (usually with only one or 0 values).

The core code of the MapReduce function that counts the word frequency is very brief, mainly to implement these two functions.

Map (string key, String value):

Key:document Name

Value:document contents

For each word w in value:

Emitintermediate (W, "1");

Reduce (String key, Iterator values):

Key:a Word

Values:a List of Counts

int result = 0;

For each V in values:

Result + = parseint (v);

Emit (asstring (result));

In the statistical frequency of the example, the map function accepts the key is the file name, the value is the contents of the document, map one by one, each encountered a word w, produces an intermediate key value to <w, "1"; this means the word w we found another MapReduce passes key-value pairs of the same key (both word w) to the reduce function, so that the reduce function accepts a key that is the word w, the value is a string of "1" (the most basic implementation is this, but can be optimized), the number equals the number of key-value pairs of key W, and then these "1" Accumulate to get the number of occurrences of the word W. Finally, the number of occurrences of these words will be written to the user-defined location, stored in the underlying distributed storage System (GFS or HDFS).



Working principle



The diagram above is the flowchart given in the paper. Everything starts at the top of the user program, and the user program links the MapReduce library and implements the most basic map and reduce functions. The order of execution in the figure is marked with a number.

The 1.MapReduce library first divides the input file of user program into m (user-defined), each of which usually has 16MB to 64MB, as shown on the left side of the figure, and then uses fork to copy the user process to the other machines in the cluster split0~4.

A copy of the 2.user program is called Master, and the remainder is called Worker,master, which is responsible for scheduling, assigning jobs (map jobs or reduce jobs) to idle workers, and the number of worker can also be specified by the user.

3. The worker assigned the map job begins to read the input data for the corresponding Shard, the number of map jobs is determined by M, and the split one by one corresponds; the map job extracts key-value pairs from the input data, and each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs produced by the map function are cached in memory.

4. The middle key value pairs of the cache are periodically written to the local disk, and are divided into r regions, the size of R is user-defined, in the future each area will correspond to a reduce job; the location of these intermediate key-value pairs is communicated to master,master responsible for forwarding the information to the reduce worker.

The 5.MASTER notification assigns the worker of the reduce job where the partition it is responsible for (certainly more than one place, the intermediate key-value pairs produced by each map job may map to all r different partitions), and when the reduce worker reads all of the intermediate key-value pairs that it is responsible for. , they are sorted so that key-value pairs of the same key are clustered together. Sorting is necessary because different keys may map to the same partition, which is the same reduce job (who makes the partition less).

6.reduce workers traverse sorted intermediate key-value pairs, and for each unique key, the key is passed to the reduce function with the associated value, and the output from the reduce function is added to the output file of the partition.

6. When all the map and reduce jobs are completed, the master Wakeup genuine user Program,mapreduce function call returns the code of the user program.

After all execution, the MapReduce output is placed in the output file of the R partition (one for each reduce job). Users usually do not need to merge the R files, but instead give them as input to another MapReduce program. Throughout the process, the input data comes from the underlying Distributed File System (GFS), where the intermediate data is placed on the local file system, and the final output data is written to the underlying Distributed File System (GFS). And we have to pay attention to the difference between the map/reduce job and the Map/reduce function: The map job processes a shard of input data and may need to call multiple map functions to handle each input key value pair; the reduce job processes a partition's middle key-value pair, The reduce function is called once for each different key, and the reduce job eventually corresponds to an output file.

Summarize:

Through the above do you understand what is mapreduce that, what is key, how to filter valid data, how to get the data you want.
MapReduce is a programming idea that can be implemented using Java, C + +. The role of map is to filter out some raw data, and reduce is to process the data to get the results we want, such as you want to create tomato chili sauce. That is, we use Hadoop, for example, to log processing, to get the data we want to care about

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.