Mapreduce: Google's Human Cannon

Source: Internet
Author: User
Mapreduce: Google's Human Cannon

The most authoritative introduction to mapreduce on the network is Jeffrey Dean.

And Sanjay Ghemawat: mapreduce: Simpli ed data processing on large clusters

You can download it from labs.google.com.

For companies such as Goole who need to analyze and process massive data, ordinary programming methods are not enough. So Google developed mapreduce. In simple terms, syntax mapreduce is like lisp. Using the mapreduce model, you can specify a map method to process data such as key/value and generate an intermediate key/value pair, then use the reduce method to merge all the intermediate key/value pairs with the same key to generate the final result. Google mapreduce is a programming tool running on thousands of machines to process TB data.

It is said that in a programming model like mapreduce,ProgramThe cluster machines can be executed in parallel distribution. Just as Java programmers can ignore memory leaks, mapreduce programmers are not allowed to worry about how massive data is distributed to multiple machines. What should they do if the machine involved in computing fails, you do not need to consider how these machines work together.

For example, when I am working on the Bayesian forum to block the demo system Beta 1, I need to calculate the frequency of occurrence of each word in the sample data. My calculation step is to first split the word and then process it with a hash table. If I encounter terabytes of data, I can't afford to use the race Yang CPU. What will it look like in mapreduce?

The following is a pseudo implementation:

Step 1:

Map (string key, string value ):

// Key: Document Name

// Value: Document Content

For each word w in value:

Emitintermediate (W, "1 ");

Step 2:

Reduce (string key, iterator values ):

// Key: a word

// Values: frequency data of the word

Int result = 0;

For each V in values:

Result + = parseint (v );

Emit (asstring (result ));

 

If you have read the vector space model, you will know that this is the semantic Implementation of computing TF and IDF.

Google's webreduce package is implemented in C ++. In mapreduce: the Simpli ed data processing on large clusters article also contains a piece of real webreduce Code .

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.