MapReduce: The free lunch isn't over yet?

Source: Internet
Author: User
Keywords Nbsp; value nbsp; value
Tags blogger change computing cpu data development distributed distributed file system

Microsoft's famous C + + master Herb Sutter wrote a heavyweight article in early 2005: "The free lunch is over:a fundamental Turn toward concurrency in Software", Another major change in software development after OO is predicted-parallel computing.

The era of software development under Moore's Law has a very interesting phenomenon: "Andy giveth, and Bill taketh Moz." No matter how fast the CPU is, we always have a way to use it, and we revel in the improvement of the program performance brought about by machine upgrades.

I remember when I was a sophomore once did a Gobang program, then the algorithm is to design some chess (priority), and then scan the board, the situation analysis, to see which part of the current go to their most important. Of course, playing chess also need to block others, this needs to exchange both sides of the chess type recalculation. If only one step, it is likely to be cunning opponents cheat, so in order to think of a few steps, but also need recursion and back to the moon. At the time of the machine, calculate 3 steps will basically need 3 seconds of time. After graduating from the university to pick up things to find this program, tried, and found that the 10 steps required time is basically not feel out.

I don't know if you have the same experience, we have been enjoying such a free lunch unconsciously. However, with Moore's Law in advance, the free lunch will eventually return. Although the hardware designers are still trying to: Hyper Threading CPU (a set of registers, equivalent to a logical CPU) makes pipeline as full as possible, so that multiple thread operations may be parallel, so that the performance of multithreaded procedures 5%-15% of the upgrade ; Increasing cache capacity also benefits both Single-thread and multi-thread programs. Maybe it will help you for a while, but the problem is that we have to make a change, are you ready for this coming change?

Concurrency programming!= multi-thread programming. Many people will say multithreading who will not, the question is, why do you use/How to use multithreading? I used to do a acdsee-like image viewing/processing program that I usually use to process my digital photos. I used a lot of multithreading in it, but the main purpose was not to block the UI while the image was being processed, so the CPU intensive calculation part was processed with a background thread. And the operation of the image matrix is not separated by parallel.

I think the real challenge for concurrency programming is the change in the programming model, where the programmer's mind has a clear understanding of how the program is parallelization, and more importantly, how to implement it (including architecture, fault tolerance, Real-time monitoring and so on) this parallelization, how to debug, how to test.

At Google, there is a huge amount of data that needs to be handled in a limited amount of time every day (in fact, every internet company has a problem), and every programmer needs a distributed program development, including how to distribute, dispatch, monitor, and fault tolerance. Google's MapReduce is to abstract distributed business logic from these complex details, allowing programmers with little or no parallel development experience to develop parallel applications.

The two most important words in MapReduce are map (map) and reduce (statute). At first glance Map/reduce These two words, familiar with the function language person must feel very familiar. FP calls such a function "Exit Order function" (the "High order function" is one of the tools of the function programming), i.e., These functions are written to be combined with other functions (or called by other functions). If the hard to compare, you can think of it as C inside the callback function, or STL inside the functor. For example, if you want to look up an STL container, you need to make a comparison of each of the two elements of the functor (Comparator), the Comparator will be called when traversing the container.

Take the previous said image processing program to illustrate, in fact, most of the image processing operations are to the image matrix for some sort of operation. There are usually two kinds of operations here, one is mapping, the other is the statute. With two effects, the "old photo" effect is usually to harden the g/b value of the photo, and then add some random offsets to each pixel, which is independent of each element of the two-dimensional matrix and is a map operation. and the "carving" effect needs to extract the edge of the image, you need the operation between the elements, is a reduce operation. To give a simple example, a one-dimensional matrix (array) [0,1,2,3,4] can be mapped to [0,2,3,6,8] (multiplied by 2), or it can be mapped to [1,2,3,4,5] (plus 1). It can be a protocol of 0 (element Quadrature) can also be the statute of 10 (element summation).

In the face of complex problems, the ancients taught us to divide and conquer, and the corresponding words in English were "Divide and conquer". Map/reduce In fact is the process of divide/conquer, by divide the problem, make these divide map operations highly parallel, and then reduce the results after the map (according to a key), get the final result.

Googler found that this is the core of the problem, others are common problems. So they separate the MapReduce abstraction. In this way, Google programmers can only care about the application logic, care about which key to decompose the problem, which operations are map operations, which operations are reduce operations. The complex problems in other parallel computations, such as distribution, job scheduling, fault tolerance and communication between machines, are given to the map/reduce framework, which simplifies the whole programming model to a large extent.

Another feature of MapReduce is that both the input and output of map and reduce are intermediate temporary files (mapreduce use the Google File system to manage and access these files), rather than other modes of communication between different processes or between machines. I think, this is Google's consistent style, to simplify, to get to the basics.

Next, lay down the others and study the map/reduce operation. (Other such as fault-tolerant, backup tasks have a very classic experience and implementation, the paper is in detail)

Definition of Map:

Map, written by the user, takes an input pair and produces a set of intermediate key/value. The MapReduce library groups up all intermediate values associated with the Mahouve intermediate key I and passes to the Reduce function.

Definition of reduce:

The Reduce function, also written by the user, accepts a intermediate key I and a set of values for that key. It merges up these values to form a possibly smaller set of values. Typically ethically zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This is allows us to handle lists of the values that are too SCM to fit in memory.

The MapReduce paper gives an example of how many times each word appears in a collection of documents.

The input of the map operation is a document that outputs each word in the input document to the intermediate file.

Map (string key, String value):
Key:document Name
Value:document contents
For all Word w in value:
Emitintermediate (W, "1″");

For example, we have two documents, each of which is

A-"I Love Programming"

B-"I am a blogger, you are also a blogger".

b documents the intermediate files that are exported after the map operation will be:

i,1 am,1 a,1 blogger,1 you,1 are,1 a,1 blogger,1

The input of the reduce operation is a sequence of words and occurrences. In the example above, it is ("I", [1, 1]), ("Love", [1]), ("Programming", [1]), ("AM", [1]), ("A", [1,1]), etc. Then according to each word, calculate the total number of occurrences.

Reduce (String key, iterator values):
Key:a Word
Values:a List of Counts
int result = 0;
For each V in values:
result = parseint (v);
Emit (Result) (asstring);

The final result of the output will be: ("I", 2″), ("A", 2″) ...

The actual order of execution is:

The

MapReduce library divides input into M. Here the input splitter can also be multiple machines parallel split. Master <key The M job to the M worker in the idle state; For each of the inputs, value> pair carries out map operations, which buffer the intermediate results into memory; periodically (or according to the memory state) Dumps the intermediate information in the buffer to the local disk and returns the file information back to master (master needs to send this information to reduce worker). The most important point here is that when you write a disk, you need to make the intermediate file partition (for example, R). Take the example above to illustrate that if you save all the information to a file, Reduce worker becomes a bottleneck. We just have to make sure that the same key can appear in the same partition to break down this problem. R A reduce worker starts to work and gets the data from the different map worker's partition (read the buffered data from the workloads the map brought), Sort with key (if you need to use an external sort-external sort) in memory. Obviously, sorting (or group) is a step that must be made before the reduce function. It's critical that every reduce worker gets the intermediate results of X (0<X<R) partition from a lot of map worker, so that all the information that belongs to this key is already on this worker. Reduce worker traverses the intermediate data and executes the reduce function for each unique key (the parameter is this key and a corresponding series of value). After execution, wake up the user program and return the result (finally there should be r output, one for each reduce worker).

Thus, the points (Divide) are represented in two steps, dividing the input into m and dividing the middle result of the map into R. Separating the input is usually very simple, and the intermediate result of the map is usually the result of the "hash (key) mod R" as the standard, ensuring that the same key appears in the same partition. Of course, the user can also specify their own partition Function, for example, for URL Key, if you want the same host URL to appear in the same partition, you can use "hash (Hostname (urlkey)) mod R" As Partition Function.

For the above example, there may be thousands of intermediate results ("the", 1) in each document, and trivial intermediate files will inevitably cause loss of transmission. Therefore, MapReduce also supports user-supplied combiner Function. This function usually has the same implementation as the reduce function, except that the output of the reduce is the final result, whereas the output of the combiner function is an intermediate file for one of the input of the reduce function.

Tom White gives another very intuitive example of nutch[2], distributed grep. I've always felt that many of the operations in pipe, such as more, Grep, and Cat, are similar to a map operation, and Sort, Uniq, WC, etc. are equivalent to some kind of reduce operation.

Plus the first two days of Google's BigTable paper, Google now has its own cluster-Googel Cluster, Distributed File System-GFS, distributed computing environment-MAPREDUCE, distributed structured storage-BigTable, Plus lock Service. I can really feel the other free dinner for programmers outside of Google's famous free dinner, the SCM clusters made up of a large number of commodity PCs. I think these are really the core values of Google.

Oh, like Microsoft veteran Joel Spolsky (you should have seen his "Joel on Software"?) Once said, the most frightening thing for Microsoft is [1], when Microsoft is struggling to catch up with Google to improve search functionality, Google is already deploying the next generation of supercomputers.

The very fact that Google invented MapReduce, and Microsoft didn't ' t, says something about why Microsoft are e.g. playing catch up Trying to get basic search features to work while Google Super-delegates moved on to the next problem:building skynet^h^h^h^h^h^h the World ' s largest massively parallel supercomputer. I don ' t have the Microsoft completely understands ethically how far behind tightly in that wave.

Note 1: In fact, Microsoft also has its own solution-DryAd. The problem is how hard it will be to redeploy such a low-level infrastructure in big companies, both for technical reasons and for political reasons.

Note 2:lucene's father Doug Cutting's another masterpiece, Project Hadoop-made up of the Hadoop Distributed File system and a map/reduce implementation, and Lucene/nutch's production line is complete enough.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.