Legend of rivers and lakes: Google technologies include "sanbao", gfs, mapreduce, and bigtable )!
Google has published three influential articles in three consecutive years from to 06, namely, gfs of sosp, mapreduce of osdi in 04, and bigtable of osdi in 06. Sosp and osdi are both top-level conferences in the operating system field and belong to Class A in the Computer Society recommendation meeting. Sosp is held in the singular year, while osdi is held in double years.
This blog will introduce mapreduce.
1. mapreduce did not find Google, so I want to use a hadoop project structure to describe the position of mapreduce, as shown in.
Hadoop is actually an open-source implementation of Google sanbao. hadoop mapreduce corresponds to Google mapreduce, hbase corresponds to bigtable, and HDFS corresponds to gfs. HDFS (or GFS) provides efficient unstructured storage services for the upper layer. hbase (or bigtable) is a distributed database that provides structured data services, hadoop mapreduce (or Google mapreduce) it is a programming model of Parallel Computing for job scheduling.
GFS and bigtable have already provided us with high-performance and high-concurrency services, but parallel programming is not a task that all programmers can play with. If our applications cannot be concurrent, GFS and bigtable are meaningless. The greatness of mapreduce is that programmers who are not familiar with parallel programming can make full use of the power of distributed systems.
To put it simply, mapreduce is a framework that splits a large job into multiple small jobs (major jobs and small jobs should be essentially the same, but with different scales ), what you need to do is to decide how many parts to split and define the job itself.
The following uses an example throughout the full text to explain how mapreduce works.
2. Example: Count Word Frequency
If I want to count the most frequently used words in computer papers over the past 10 years and see what everyone is studying, what should I do after collecting my papers?
Method 1: I can write a small program to traverse all the papers in order, count the number of times each word appears, and finally I can know which words are the most popular.
This method is very effective when the dataset is relatively small, and is the easiest to implement. It is suitable for solving this problem.
Method 2: Write a multi-threaded program and traverse the thesis concurrently.
In theory, this problem can be highly concurrent, because the statistics on one file will not affect the statistics on another file. When our machine is a multi-core or multi-processor, method 2 is definitely more efficient than method 1. However, writing a multi-threaded program is much more difficult than a method. We must synchronize and share data on our own, for example, to prevent two threads from repeating statistical files.
Method 3: Submit the job to multiple computers for completion.
We can use the method 1 program to deploy it to N machines, then divide the papers into N parts, and one machine runs a job. This method runs fast enough, but it is difficult to deploy it. We need to manually copy the program to another machine and manually separate the documents, the most painful thing is to integrate n running results (of course we can write another program ).
Method 4: Let mapreduce help us!
Mapreduce is essentially method 3, but the framework defines how to split the file set, copy the program, and integrate the results. We only need to define this task (User Program) and submit the rest to mapreduce.
Before introducing how mapreduce works, let's talk about the pseudo code of two core functions: map, reduce, and mapreduce.
3. Map and reduce Functions
The map function and reduce function are implemented by the user. These two functions define the task itself.
- Map function: receives a key-value pair to generate a group of intermediate key-value pairs. The mapreduce framework passes the same key value in the intermediate key-Value Pair generated by the map function to a reduce function.
- Reduce function: accepts a key and a set of related values, and merges these values to produce a group of smaller values (usually only one or zero values ).
The core code of the mapreduce function for calculating word frequency is very short, mainly to implement these two functions.
map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
In the example of Word Frequency Statistics, the map function accepts a file name as the key and a value as the content of the file. Map traverses words one by one, and every time a word is encountered W, generate an intermediate key-Value Pair <W, "1">, which indicates that W has found another one. mapreduce will share the same key (both words are W) in this way, the reduce function accepts the word W, and the value is a string of "1" (the most basic implementation is this, but it can be optimized ), the number is equal to the number of key-value pairs whose key is W, and then "1" is accumulated to obtain the number of occurrences of the word w. The occurrence times of these words are written to user-defined locations and stored in the underlying distributed storage system (GFS or HDFS ).
4. How mapreduce works
Is the flowchart in the paper. Everything starts from the top user program. User Program links the mapreduce library and implements the most basic map and reduce functions. The execution sequence in the figure is marked with numbers.
- The mapreduce library divides the input file of user program into M parts (M is user-defined), each of which usually has 16 MB to 64 MB, and the left side is divided into split0 ~ 4. Then, use fork to copy user processes to other machines in the cluster.
- One copy of the user program is called the master, and the others are called worker. The master is responsible for scheduling and allocates jobs (MAP jobs or reduce jobs) to the idle worker ), the number of workers can also be specified by the user.
- The worker assigned to the map job starts to read the input data of the corresponding shard. The number of map jobs is determined by M and corresponds to split one by one; the map job extracts key-value pairs from the input data. Each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs generated by the map function are cached in the memory.
- Cached intermediate key-value pairs are regularly written to the local disk and divided into R zones. The size of R is defined by the user. In the future, each zone will correspond to a reduce job; the locations of these intermediate key-value pairs are reported to the master, and the master is responsible for forwarding the information to the reduce worker.
- The master notifies the worker assigned the reduce job where it is responsible for the partition (there must be more than one location, the intermediate key-value pairs generated by each map job may be mapped to all r different partitions.) When reduce Worker reads all the intermediate key-value pairs it is responsible, sort them first, so that key-value pairs with the same key are together. Because different keys may be mapped to the same partition, that is, the same reduce job (who makes fewer partitions), sorting is required.
- Reduce worker traverses the sorted intermediate key-value pairs. For each unique key, the key and the associated value are passed to the reduce function, the output produced by the reduce function is added to the output file of this partition.
- When all the map and reduce jobs are completed, the master wakes up the genuine user program, and the mapreduce function calls the user program code.
After all the execution is complete, the mapreduce output is placed in the output files of the r partitions (corresponding to a reduce job respectively ). Generally, users do not need to merge the r files, but instead give them as input to another mapreduce program for processing. During the entire process, the input data comes from the underlying Distributed File System (GFS), the intermediate data is stored in the local file system, and the final output data is written to the underlying Distributed File System (GFS. Note the differences between MAP/reduce jobs and MAP/reduce functions: Map jobs process input data shards and may need to call multiple map functions to process each input key-value pair; when a reduce job processes an intermediate key-value pair of a partition, it calls the reduce function once for each different key. The reduce job eventually corresponds to an output file.
I prefer to divide the process into three stages. The first stage is the preparation stage, including step 1 and Step 2. The main role is the mapreduce library, which completes tasks such as splitting jobs and copying user programs. The second stage is the running stage, including 3, 4, 5, and 6. The main character is the User-Defined map and reduce functions, and each small job runs independently. The third stage is the scanning stage, and the job has been completed, the job result is stored in the output file, which is determined by how the user wants to process the output.
5. How is the word frequency counted?
In combination with section 4, we can know how the code in Section 3 works. Suppose we define M = 5, r = 3, and there are 6 machines and one master.
This figure describes how mapreduce processes Word Frequency Statistics. Because the number of map workers is insufficient, partitions 1, 3, and 4 are processed first, and an intermediate key-value pair is generated. When all the intermediate values are ready, the reduce job starts to read the corresponding partition, and output the statistical results.
6. users' rights the main task of users is to implement map and reduce interfaces, but some useful interfaces are open to users.
- An input reader. This function divides the input into M parts and defines how to extract the original key-value pairs from the data. For example, in the Word Frequency example, the file name and content are key-value pairs.
- A partition function. This function is used to map the intermediate key-value pairs generated by the map function to a partition. The simplest implementation is to calculate the hash of the key and then modulo the R.
- A Compare function. This function is used to sort reduce jobs. This function defines the size relationship of keys.
- An output writer. Writes the results to the underlying distributed file system.
- A combiner function. It is actually the reduce function, which is used for the optimization mentioned above. For example, when calculating word frequency, if each <W, "1"> needs to be read once, because reduce and map are usually not on one machine, it is a waste of time, so you can run combiner once in the place where map is executed, so reduce only needs to read <W, "N"> once.
- The map and reduce functions are not much said.
7. mapreduce has been implemented in many ways. Apart from Google's own implementation, there is also a famous hadoop. The difference is that Google is C ++, while hadoop uses Java. In addition, Stanford University has implemented a mapreduce running in multi-core/multi-processor and shared memory environments, called Phoenix (Introduction). The related papers are published on HPCA in, is the best paper of the year! References [1] mapreduce: simplified data processing on large clusters. in Proceedings of osdi '04. [2] Wikipedia. http://en.wikipedia.org/wiki/mapreduce%3] Phoenix. http://mapreduce.stanford.edu/4244] evaluating mapreduce for multi-core and multiprocessor systems. in Proceedings of HPCA '07.