The fundamentals of MapReduce

Source: Internet
Author: User
Tags hadoop mapreduce

Transferred from: http://blog.csdn.net/opennaive/article/details/7514146

Legends of the rivers and lakes: Google technology has "three treasures", GFS, MapReduce and Big Table (BigTable)!

Google has published three influential articles in the past 03-06 years, namely the gfs,04 of the 03 Sosp osdi, and 06 Osdi bigtable. Sosp and OSDI are top conferences in the field of operating systems and belong to Class A in the Computer Academy referral Conference. SOSP is held in singular years, and OSDI is held in even-numbered years.

So this blog is going to introduce MapReduce.

1. What does MapReduce do? Because Google is not found, I would like to borrow a structure diagram of a Hadoop project to illustrate where the next mapreduce is located, such as. Hadoop is actually the open source implementation of Google Sambo, Hadoop MapReduce corresponding to the Google mapreduce,hbase Bigtable,hdfs corresponding gfs. HDFS (or GFS) provides efficient unstructured storage services for the upper layer, and HBase (or bigtable) is a distributed database that provides structured data services, and Hadoop MapReduce (or Google MapReduce) is a programming model for parallel computing. Used for job scheduling.

GFS and BigTable have provided us with high-performance, high-concurrency services, but parallel programming is not something that all programmers can do, and if our application itself cannot be concurrent, then GFS and bigtable are meaningless. The great thing about MapReduce is that programmers who are unfamiliar with parallel programming can unleash the power of distributed systems.

In a nutshell, MapReduce is the framework for splitting a large job into multiple small jobs (big jobs and small jobs should be essentially the same, just in scale), and what users need to do is decide how many to split and define the job itself.

Here's an example that goes through the full text to explain how MapReduce works.

2. Example: Statistical frequency

What if I wanted to count the last few words in My computer paper for the past 10 years and see what everyone was studying, and what should I do when I collected my papers?

Method One: I can write a small program, all the papers in order to go through, count the number of occurrences of each word encountered, and finally you can know which words are most popular.

This method is very effective in comparing the hours of the data set, and is the easiest to implement, and it is appropriate to solve the problem.

Method Two: Write a multithreaded program, and traverse the paper concurrently.

This problem can theoretically be highly concurrent, because counting a file does not affect the statistics of another file. When our machines are multicore or multiprocessor, method two is certainly more efficient than method one. But writing a multithreaded program is much more difficult than the method, and we have to synchronize the data ourselves, such as to prevent two threads from repeating the statistics file.

Method Three: Give the job to multiple computers to complete.

We can use the method one program, deploy to n machines, and then divide the papers into n parts, a machine to run a job. This method runs fast enough, but the deployment is cumbersome, we have to manually copy the program to another machine, to manually separate the proceedings, the most painful thing is to integrate the results of the N run (of course, we can also write a program).

Method Four: Let MapReduce help us!

MapReduce is essentially a method of three, but how to split the file set, how to copy the program, how to integrate the results these are framework-defined. All we have to do is define this task (user program) and give it to MapReduce.

Before describing how MapReduce works, let's talk about two core function map and reduce and the pseudo-code of MapReduce.

3. Map function and reduce function

The map function and the reduce function are given to the user, and the two functions define the task itself.

    • Map function: Accepts a key-value pair (Key-value pair), producing a set of intermediate key-value pairs. The MapReduce framework passes the same value of the middle key value generated by the map function to a reduce function.
    • Reduce function: Accepts a key, together with a related set of values, that merges the set of values to produce a smaller set of values (usually with only one or 0 values).

The core code of the MapReduce function that counts the word frequency is very brief, mainly to implement these two functions.

1 Map (string key, String value):2     //key:document name3     //value:document Contents4      forEach word w in value:5Emitintermediate (W, "1"); 6   7 reduce (String key, Iterator values):8     //key:a Word9     //values:a List of countsTen     intresult = 0;  One      forEach V in values: AResult + =parseint (v);  -Emit (asstring (result));

In the statistical frequency of the example, the map function accepts the key is the file name, the value is the contents of the document, map one by one, each encountered a word w, produces an intermediate key value to <w, "1"; this means the word w we found another MapReduce passes key-value pairs of the same key (both word w) to the reduce function, so that the reduce function accepts a key that is the word w, the value is a string of "1" (the most basic implementation is this, but can be optimized), the number equals the number of key-value pairs of key W, and then these "1" Accumulate to get the number of occurrences of the word W. Finally, the number of occurrences of these words will be written to the user-defined location, stored in the underlying distributed storage System (GFS or HDFS).

4. How MapReduce Works

is the flowchart given in the paper. Everything starts at the top of the user program, and the user program links the MapReduce library and implements the most basic map and reduce functions. The order of execution in the figure is marked with a number.

    1. The MapReduce library first divides the input file of user program into m (user defined), each usually 16MB to 64MB, the left side is divided into split0~4, and then uses fork to copy the user process to the other machines in the cluster.
    2. A copy of user program is called Master, and the remainder is called Worker,master, which is responsible for scheduling, assigning jobs (map jobs or reduce jobs) to idle workers, and the number of worker can also be specified by the user.
    3. The worker assigned the map job begins to read the input data for the corresponding Shard, the number of map jobs is determined by M, and the split one by one corresponds; the map job extracts the key-value pairs from the input data, and each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs produced by the map function are cached in memory.
    4. The cached intermediate key-value pairs are periodically written to the local disk, and are divided into R-zones, where the size of R is user-defined, and in the future each zone corresponds to a reduce job; the location of these intermediate key-value pairs is communicated to the Master,master responsible for forwarding the information to the reduce worker.
    5. Master notifies the worker that the reduce job is assigned to where it is responsible (certainly more than one place, the intermediate key-value pairs produced by each map job may map to all r different partitions), and when the reduce worker reads all of the intermediate key-value pairs that it is responsible for. They are sorted first so that key-value pairs of the same key are clustered together. Sorting is necessary because different keys may map to the same partition, which is the same reduce job (who makes the partition less).
    6. The reduce worker iterates through the sorted middle key-value pair, passing the key and associated values to the reduce function for each unique key, and the output generated by the reduce function is added to the output file of the partition.
    7. When all the map and reduce jobs are completed, the master Wakeup genuine user Program,mapreduce function call returns the code of the user program.

After all execution, the MapReduce output is placed in the output file of the R partition (one for each reduce job). Users usually do not need to merge the R files, but instead give them as input to another MapReduce program. Throughout the process, the input data comes from the underlying Distributed File System (GFS), where the intermediate data is placed on the local file system, and the final output data is written to the underlying Distributed File System (GFS). And we have to pay attention to the difference between the map/reduce job and the Map/reduce function: The map job processes a shard of input data and may need to call multiple map functions to handle each input key value pair; the reduce job processes a partition's middle key-value pair, The reduce function is called once for each different key, and the reduce job eventually corresponds to an output file.

I prefer to divide the process into three stages. The first stage is the preparation phase, including 1, 2, the main character is the MapReduce library, complete the split job and copy the user program and other tasks, the second stage is the run phase, including 3, 4, 5, 6, the protagonist is user-defined map and reduce function, each small job is run independently, the third stage is the completion stage, When the job is finished and the results are placed in the output file, it depends on what the user wants to do with the output.

5. How the word frequency is counted

In conjunction with section fourth, we can see how the code in section three works. Suppose we define m=5,r=3, and there are 6 machines, one master.

This picture depicts how mapreduce deals with Word frequency statistics. Because the number of map workers is not enough, the partition 1, 3, 4 are processed first, and intermediate key-value pairs are produced; when all the intermediate values are ready, the reduce job begins to read the corresponding partitions and outputs the statistical results.

6. User rights The most important task for users is to implement the map and reduce interfaces, but there are some useful interfaces that are open to users.
    • An input reader. This function divides the input into m parts and defines how to extract the initial key-value pairs from the data, such as the definition of the file name in the word frequency example and the key-value pair.
    • a partition function. This function is used to map the intermediate key-value pairs produced by the map function to a partition, and the simplest implementation is to hash the keys and then modulo the R.
    • A compare function. This function is used to sort the reduce job, which defines the key size relationship.
    • An output writer. Responsible for writing the results to the underlying distributed file system.
    • A combiner function. The actual is the reduce function, which is used for the optimization mentioned earlier, such as statistical word frequency, if each <w, "1" > to read once, because reduce and map is usually not a machine, very waste of time, so you can run the map in the place of the first time combiner , so that reduce only needs to read <w, "n" >.
    • The map and reduce functions are not much to say.
7. MapReduce implementation of MapReduce has been implemented in many ways, in addition to Google's own implementation, there is also the famous Hadoop, the difference is that Google is C + +, and Hadoop is java. In addition, Stanford University has implemented a mapreduce, called Phoenix (Introduction), that runs within a multi-core/multiprocessor, shared memory environment, and the papers published in 07 HPCA, the best paper of the year!

The fundamentals of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.