Mapreduce: Google's Human Cannon

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The most authoritative introduction to mapreduce on the network is Jeffrey Dean.

And Sanjay Ghemawat: mapreduce: Simpli ed data processing on large clusters

You can download it from labs.google.com.

For companies such as Goole who need to analyze and process massive data, ordinary programming methods are not enough. So Google developed mapreduce. In simple terms, syntax mapreduce is like lisp. Using the mapreduce model, you can specify a map method to process data such as key/value and generate an intermediate key/value pair, then use the reduce method to merge all the intermediate key/value pairs with the same key to generate the final result. Google mapreduce is a programming tool running on thousands of machines to process TB data.

It is said that in a programming model like mapreduce,ProgramThe cluster machines can be executed in parallel distribution. Just as Java programmers can ignore memory leaks, mapreduce programmers are not allowed to worry about how massive data is distributed to multiple machines. What should they do if the machine involved in computing fails, you do not need to consider how these machines work together.

For example, when I am working on the Bayesian forum to block the demo system Beta 1, I need to calculate the frequency of occurrence of each word in the sample data. My calculation step is to first split the word and then process it with a hash table. If I encounter terabytes of data, I can't afford to use the race Yang CPU. What will it look like in mapreduce?

The following is a pseudo implementation:

Step 1:

Map (string key, string value ):

// Key: Document Name

// Value: Document Content

For each word w in value:

Emitintermediate (W, "1 ");

Step 2:

Reduce (string key, iterator values ):

// Key: a word

// Values: frequency data of the word

Int result = 0;

For each V in values:

Result + = parseint (v );

Emit (asstring (result ));

If you have read the vector space model, you will know that this is the semantic Implementation of computing TF and IDF.

Google's webreduce package is implemented in C ++. In mapreduce: the Simpli ed data processing on large clusters article also contains a piece of real webreduce Code .

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mapreduce: Google's Human Cannon

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support