Mapping-Merging algorithm

Source: Internet
Author: User
Keywords Algorithm function we value
Tags application basic code design different disk function high

20.4 mapping-merging algorithms and disk indexing programs

Now we have to shift from theory to practice. First, we'll look at the High-order function MapReduce, and then we'll use this technique in a simple indexing engine. Here, our goal is not to be the fastest and best indexing engine in the world, but to solve the problem of real design under the relevant application scenario.

20.4.1 Mapping-Merging algorithm

In Figure 20-2, we show the basic idea of the mapping-merging (map-reduce) algorithm. A number of mapping processes are turned on to generate a series of key-value pairs such as {key, value}. The mapping process sends these key-value pairs to a merge process, which is responsible for merging the key-value pairs, combining the values that have the same key.

Warning

The word map mentioned here, especially in the context of MapReduce, is completely different from the map function mentioned in other parts of this book, and avoid confusion.

MapReduce (Map-merge algorithm) is a higher-order parallel function proposed by Google's Jeffrey Dean and Sanjay Ghemawat, which is said to be used on a daily basis in Google's clusters.

Fig. 20-2 Mapping-Merging algorithm

We can implement many different semantic mapping-merging algorithms in many different ways. The algorithm is not so much a specific algorithm as a family of algorithms.

MapReduce is defined as:

 

F1 (Pid, X) is a mapping function. F1 's task is to send a set of {Key, Value} data to the PID, and then exit. MapReduce creates a new process for each x in the list each time.

F2 (Key, [Value], ACC0)-> acc is a merge function. When all the mapping functions are exited, the merge function is responsible for merging all the values corresponding to each key. At this point, it invokes the F2 (key, [value], ACC) function for every {key, [value]} it collects. ACC is an accumulator whose initial value is Acc0. F2 returns a new accumulator (another way of describing this is that F2 performs a folding operation on all of the {Key, [Value]} pairs that it collects).

ACC0 is the initial value of the accumulator and is used when calling F2.

L is a list of x. F1 (PID, X) operates on every X in the list L, which is the process identifier of the merge process created by MapReduce.

MapReduce is defined in the Phofs (parallel higher-order function abbreviation) module:

  

Before we go any further, let's test this mapreduce function so we can understand more about its working mechanism.

We're going to write a small program to count the frequency of all the words in the code attached to the book, which is the code:

When

runs, there are 102 Erlang modules in the code directory, so MapReduce also creates 102 concurrent processes, each sending a data stream consisting of key-value pairs to the merge process. This should work well on 100 core processes (if the hard drive is up to speed).

Now that we know what MapReduce is, it's time to go back to the index engine.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.