This chapter provides a guide for designing mapreduce algorithms. In particular, we show a lot of design patterns to solve common problems. In general, they are:
"In-mapper combining" (merge in map), The combiner function is moved to Mapper, and mapper aggregates partial results through multiple input records, then, an intermediate key-value pair is sent only after a certain amount of partial aggregation, instead of the intermediate output of each input key-value pair.
The related models pairs and stripes track joint events through a large number of observations. In the pairs method, we track every joint event separately, but in the stripes method, we track all events that occur simultaneously with the same event. Although the stripes method is obviously more efficient, it requires enough memory to put all the events down, which may lead to a scaling bottleneck.
"Order inversion" (Reverse Order), the main idea is to convert the order of calculation into a sorting problem. Through careful arrangement, we send the computation results (for example, a clustering statistics) of the data before it meets the data required in this computation to Cer CER for processing.
"Value-to-key conversion", which provides a scalability solution for secondary sorting. By moving the part score to the key, we can use the mapreduce method to sort it.
Finally, the control of synchronization in the mapreduce programming model comes down to whether the following technologies can be effectively used:
1. construct complex keys and values to combine the datasets to be computed. This can be used for all the design patterns mentioned above.
16 to achieve efficient access to Distributed Key-value storage, it usually needs to Perform Batch queries before synchronous requests () or dependent synchronization requests.
2. execute custom initialization and termination code in mapper or reducer. For example, the in-mapper combining mode relies on sending an intermediate key-value pair in the termination code of the map task.
3. The Mapper and reducer are saved through multiple inputs. This can be used in-mapper combining, reverse order, and key-value conversion.
4. Control the sequence of intermediate keys. This can be used for reverse order and key-value conversion.
5. Control the allocation of intermediate key space. This is used for reverse order and key-value conversion.
This summarizes our overview of mapreduce algorithm design. It is necessary to clearly understand that although the programming model forces us to express algorithms based on strictly defined components, there are still many tools to apply them to these algorithms. In the following sections, we will focus on the special maoreduce algorithm: The Reverse index in chapter 4, the graph processing in chapter 5, and the expected maximum value in chapter 6.