Google technology & quot; sanbao & quot; mapreduce

Source: Internet
Author: User
Tags hadoop mapreduce

Legend of rivers and lakes: Google technologies include "sanbao", gfs, mapreduce, and bigtable )!

Google has published three influential articles in three consecutive years from 03 to 06, respectively, gfs of sosp in 03, mapreduce of osdi in 04, and bigtable of osdi in 06. Sosp and osdi are both top-level conferences in the operating system field and belong to Class A in the Computer Society recommendation meeting. Sosp is held in the singular year, while osdi is held in double years.

This blog will introduce mapreduce.

1. mapreduce does not find Google, so I want to use a hadoop project structure to describe where mapreduce is, for example.
Hadoop is actually an open-source implementation of Google sanbao. hadoop mapreduce corresponds to Google mapreduce, hbase corresponds to bigtable, HDFS, and GFS. HDFS (or GFS) provides efficient unstructured storage services for the upper layer. hbase (or bigtable) is a distributed database that provides structured data services, hadoop mapreduce (or Google mapreduce) it is a programming model of Parallel Computing for job scheduling.

GFS and bigtable have already provided us with high-performance and high-concurrency services, but parallel programming is not a task that all programmers can play with. Suppose that our applications cannot be concurrent, GFS and bigtable are meaningless. The greatness of mapreduce is that programmers who are not familiar with parallel programming can also give full play to the power of distributed systems.

To put it simply, mapreduce is a framework that splits a large job into multiple small jobs (major jobs and small jobs should be essentially the same, but only have different scales ), what you need to do is to decide how many parts to split and define the job itself.

The following example illustrates how mapreduce works.

2. Example: Count Word Frequency

Suppose I want to count the most frequently used words in computer papers over the past 10 years and see what everyone is studying. What should I do after collecting my papers?

Method 1: I can write a small program, traverse all the papers in order, count the occurrences of each word, and finally know which words are the most popular.

This method is very effective when the dataset is smaller than the hour, and the implementation is the simplest and suitable for solving the problem.

Method 2: Write a multi-threaded program and traverse the thesis concurrently.

This problem is theoretically highly concurrent, because the statistics of a file will not affect the statistics of another file. When our machine is a multi-core or multi-processor, method 2 must be as efficient as method 1. However, it is much more difficult to write a multi-threaded program. We must synchronize and share data on our own. For example, we must prevent the two threads from making statistics on files repeatedly.

Method 3: Submit the job to multiple computers.

We can use a program to deploy it to N machines, then divide the papers into N parts, and one machine runs a job. This method runs fast enough, but it is very difficult to deploy it. We need to manually copy the program to another machine and manually separate the documents, the most painful thing is to integrate n execution results (of course we can write another program ).

Method 4: Let mapreduce help us!

Mapreduce is essentially method 3, but the framework defines how to split the file set, copy the program, and integrate the results. We only need to define this task (User Program), and the rest will be handed over to mapreduce.

Before introducing how mapreduce works, let's talk about the pseudo code of two core functions: map, reduce, and mapreduce.

3. Map and reduce Functions

The map function and reduce function are implemented by the user. These two functions define the task itself.

  • Map function: receives a key-value pair to generate a group of intermediate key-value pairs. The mapreduce framework passes the same key value in the intermediate key-Value Pair generated by the map function to a reduce function.
  • Reduce function: accepts a key and a set of related values, and merges these values to produce a group of smaller values (usually only one or zero values ).

The core code of the mapreduce function for calculating word frequency is very short, mainly to implement these two functions.

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));

In the example of Word Frequency Statistics, the map function accepts the file name as the key and the value is the file content. The map traverses words one by one, and every time a word is encountered, W, generates an intermediate key-Value Pair <W, "1">, which indicates that W has found another one. mapreduce uses the same key (both words are W) in this way, the reduce function accepts the word W, and the value is a string of "1" (the main implementation is this, but it can be optimized ), the number is equal to the number of key-value pairs whose key is W, and then "1" is accumulated to obtain the number of occurrences of the word w. The occurrence times of these words are written to user-defined locations and stored in the underlying distributed storage system (GFS or HDFS ).
4. How does mapreduce work?

Is the flowchart in the paper. All started from the top user program. the user program links the mapreduce library and implements the main map and reduce functions. The running sequence in the figure is marked with numbers.

  1. The mapreduce library divides the input file of user program into M parts (M is user-defined), each of which is usually 16 MB to 64 MB, and the left side is divided into split0 ~ 4. Then, use fork to copy user processes to other machines in the cluster.
  2. One copy of the user program is called the master, and the others are called worker. The master is responsible for scheduling and assigns jobs (MAP jobs or reduce jobs) to the worker ), the number of workers can also be specified by the user.
  3. The worker assigned to the map job starts to read the input data of the corresponding shard. The number of map jobs is determined by M, which corresponds to split; the map job extracts key-value pairs from the input data. Each key-value pair is passed to the map function as a forward data transmission. The intermediate key-value pairs generated by the map function are cached in the memory.
  4. Cached intermediate key-value pairs are regularly written to the local disk and divided into R zones. The size of R is defined by the user. In the future, each zone will have a corresponding reduce job; the locations of these intermediate key-value pairs are reported to the master, and the master is responsible for forwarding the information to the reduce worker.
  5. The master notifies the worker assigned the reduce job where it is responsible for the partition (there must be more than one location, the intermediate key-value pairs generated by each map job may be mapped to all r different partitions.) When reduce Worker reads all the intermediate key-value pairs it is responsible, sort them first, so that key-value pairs of the same key are together. Because different keys may be mapped to the same partition, that is, the same reduce job (who wants to reduce the number of partitions), sorting is required.
  6. Reduce worker traverses the sorted intermediate key-value pairs. For each unique key, the key and the associated value are passed to the reduce function, the output produced by the reduce function is added to the output file of this partition.
  7. When all the map and reduce jobs are completed, the master wakes up the genuine user program, and the mapreduce function calls the user program code.

After all the operations are completed, the mapreduce output is placed in the output files of the r partitions (corresponding to a reduce job ). Generally, you do not need to merge the r files, but instead use them as input to another mapreduce program for processing. During the entire process, the input data comes from the underlying Distributed File System (GFS), the intermediate data is stored in the local file system, and finally the output data is written to the underlying Distributed File System (GFS. Note the differences between MAP/reduce jobs and MAP/reduce functions: When map jobs process input data shards, they may need to call multiple map functions to process each input key-value pair; when a reduce job processes an intermediate key-value pair of a partition, it needs to call the reduce function once for each different key. The reduce job finally corresponds to an output file.

I prefer to divide the process into three stages. The first stage is the preparation stage, which includes 1 and 2. The main character is the mapreduce library. After splitting jobs and copying user programs are completed, the second stage is the execution stage, including 3, 4, 5, and 6. The main character is the User-Defined map and reduce functions, and each small job is executed independently. The third stage is the scanning stage, and the job has been completed, the job result is stored in the output file, which is determined by how the user wants to process the output.

5. How is the word frequency counted?

Combined with section 4, we can know how the code in Section 3 works. If we define M = 5, r = 3, and there are 6 machines and one master.


This figure describes how mapreduce processes Word Frequency Statistics. Because the number of map workers is not enough, partitions 1, 3, and 4 are processed first, and an intermediate key-value pair is generated. When all the intermediate values are ready, the reduce job starts to read the corresponding partitions, and output the statistical results.

6. users' rights the most basic task of users is to implement map and reduce interfaces, but other practical interfaces are open to users.
  • An input reader. This function divides the input into M parts and defines how to extract the initial key-value pair from the data, for example, in the example of word frequency, the file name and content are key-value pairs.
  • A partition function. This function is used to map the intermediate key-value pairs generated by the map function to a partition. The simplest implementation is to calculate the hash of the key and then modulo the R.
  • A Compare function. This function is used to sort reduce jobs. This function defines the size relationship of keys.
  • An output writer. Writes the results to the underlying distributed file system.
  • A combiner function. Actually, it is the reduce function, which is used for the optimization mentioned above. For example, when calculating word frequency, assume that each <W, "1"> must be read once, because reduce and map are usually not on one machine, it is a waste of time, so we can execute a combiner first in the place where map is executed, so reduce only needs to read it once <W, "N">.
  • The map and reduce functions are not much said.
7. The implementation of mapreduce now has multiple implementations. Apart from Google's own implementation, there is also a famous hadoop. The difference is that Google is C ++, while hadoop uses Java. In addition, Stanford University has implemented a mapreduce that runs in a multi-core/multi-processor and shared memory environment, called Phoenix (Introduction). The related papers are published in the year 07 HPCA, is the best paper of the year! References [1] mapreduce: simplified data processing on large clusters. in Proceedings of osdi '04. [2] Wikipedia. http://en.wikipedia.org/wiki/mapreduce%3] Phoenix. http://mapreduce.stanford.edu/4244] evaluating mapreduce for multi-core and multiprocessor systems. in Proceedings of HPCA '07.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.