Mapreduce: Google's Human Cannon
The most authoritative introduction to mapreduce on the network is Jeffrey Dean.
And Sanjay Ghemawat: mapreduce: Simpli ed data processing on large clusters
You can download it from labs.google.com.
For companies such as Goole who need to analyze and process massive data, ordinary programming methods are not enough. So Google developed mapreduce. In simple terms, syntax mapreduce is like lisp. Using the mapreduce model, you can specify a map method to process data such as key/value and generate an intermediate key/value pair, then use the reduce method to merge all the intermediate key/value pairs with the same key to generate the final result. Google mapreduce is a programming tool running on thousands of machines to process TB data.
It is said that in a programming model like mapreduce,ProgramThe cluster machines can be executed in parallel distribution. Just as Java programmers can ignore memory leaks, mapreduce programmers are not allowed to worry about how massive data is distributed to multiple machines. What should they do if the machine involved in computing fails, you do not need to consider how these machines work together.
For example, when I am working on the Bayesian forum to block the demo system Beta 1, I need to calculate the frequency of occurrence of each word in the sample data. My calculation step is to first split the word and then process it with a hash table. If I encounter terabytes of data, I can't afford to use the race Yang CPU. What will it look like in mapreduce?
The following is a pseudo implementation:
Step 1:
Map (string key, string value ):
// Key: Document Name
// Value: Document Content
For each word w in value:
Emitintermediate (W, "1 ");
Step 2:
Reduce (string key, iterator values ):
// Key: a word
// Values: frequency data of the word
Int result = 0;
For each V in values:
Result + = parseint (v );
Emit (asstring (result ));
If you have read the vector space model, you will know that this is the semantic Implementation of computing TF and IDF.
Google's webreduce package is implemented in C ++. In mapreduce: the Simpli ed data processing on large clusters article also contains a piece of real webreduce Code .