20.4 mapping-merging algorithms and disk indexing programs
Now we have to shift from theory to practice. First, we'll look at the High-order function MapReduce, and then we'll use this technique in a simple indexing engine. Here, our goal is not to be the fastest and best indexing engine in the world, but to solve the problem of real design under the relevant application scenario.
20.4.1 Mapping-Merging algorithm
In Figure 20-2, we show the basic idea of the mapping-merging (map-reduce) algorithm. A number of mapping processes are turned on to generate a series of key-value pairs such as {key, value}. The mapping process sends these key-value pairs to a merge process, which is responsible for merging the key-value pairs, combining the values that have the same key.
Warning
The word map mentioned here, especially in the context of MapReduce, is completely different from the map function mentioned in other parts of this book, and avoid confusion.
MapReduce (Map-merge algorithm) is a higher-order parallel function proposed by Google's Jeffrey Dean and Sanjay Ghemawat, which is said to be used on a daily basis in Google's clusters.
Fig. 20-2 Mapping-Merging algorithm
We can implement many different semantic mapping-merging algorithms in many different ways. The algorithm is not so much a specific algorithm as a family of algorithms.
MapReduce is defined as:
F1 (Pid, X) is a mapping function. F1 's task is to send a set of {Key, Value} data to the PID, and then exit. MapReduce creates a new process for each x in the list each time.
F2 (Key, [Value], ACC0)-> acc is a merge function. When all the mapping functions are exited, the merge function is responsible for merging all the values corresponding to each key. At this point, it invokes the F2 (key, [value], ACC) function for every {key, [value]} it collects. ACC is an accumulator whose initial value is Acc0. F2 returns a new accumulator (another way of describing this is that F2 performs a folding operation on all of the {Key, [Value]} pairs that it collects).
ACC0 is the initial value of the accumulator and is used when calling F2.
L is a list of x. F1 (PID, X) operates on every X in the list L, which is the process identifier of the merge process created by MapReduce.
MapReduce is defined in the Phofs (parallel higher-order function abbreviation) module:
Before we go any further, let's test this mapreduce function so we can understand more about its working mechanism.
We're going to write a small program to count the frequency of all the words in the code attached to the book, which is the code:
When
runs, there are 102 Erlang modules in the code directory, so MapReduce also creates 102 concurrent processes, each sending a data stream consisting of key-value pairs to the merge process. This should work well on 100 core processes (if the hard drive is up to speed).
Now that we know what MapReduce is, it's time to go back to the index engine.