Mapping-Merging algorithm

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Algorithm function we value

Tags application basic code design different disk function high

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

20.4 mapping-merging algorithms and disk indexing programs

Now we have to shift from theory to practice. First, we'll look at the High-order function MapReduce, and then we'll use this technique in a simple indexing engine. Here, our goal is not to be the fastest and best indexing engine in the world, but to solve the problem of real design under the relevant application scenario.

20.4.1 Mapping-Merging algorithm

In Figure 20-2, we show the basic idea of the mapping-merging (map-reduce) algorithm. A number of mapping processes are turned on to generate a series of key-value pairs such as {key, value}. The mapping process sends these key-value pairs to a merge process, which is responsible for merging the key-value pairs, combining the values that have the same key.

Warning

The word map mentioned here, especially in the context of MapReduce, is completely different from the map function mentioned in other parts of this book, and avoid confusion.

MapReduce (Map-merge algorithm) is a higher-order parallel function proposed by Google's Jeffrey Dean and Sanjay Ghemawat, which is said to be used on a daily basis in Google's clusters.

Fig. 20-2 Mapping-Merging algorithm

We can implement many different semantic mapping-merging algorithms in many different ways. The algorithm is not so much a specific algorithm as a family of algorithms.

MapReduce is defined as:

F1 (Pid, X) is a mapping function. F1 's task is to send a set of {Key, Value} data to the PID, and then exit. MapReduce creates a new process for each x in the list each time.

F2 (Key, [Value], ACC0)-> acc is a merge function. When all the mapping functions are exited, the merge function is responsible for merging all the values corresponding to each key. At this point, it invokes the F2 (key, [value], ACC) function for every {key, [value]} it collects. ACC is an accumulator whose initial value is Acc0. F2 returns a new accumulator (another way of describing this is that F2 performs a folding operation on all of the {Key, [Value]} pairs that it collects).

ACC0 is the initial value of the accumulator and is used when calling F2.

L is a list of x. F1 (PID, X) operates on every X in the list L, which is the process identifier of the merge process created by MapReduce.

MapReduce is defined in the Phofs (parallel higher-order function abbreviation) module:

Before we go any further, let's test this mapreduce function so we can understand more about its working mechanism.

We're going to write a small program to count the frequency of all the words in the code attached to the book, which is the code:

When

runs, there are 102 Erlang modules in the code directory, so MapReduce also creates 102 concurrent processes, each sending a data stream consisting of key-value pairs to the merge process. This should work well on 100 core processes (if the hard drive is up to speed).

Now that we know what MapReduce is, it's time to go back to the index engine.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More