How to explain MapReduce algorithm simply

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Algorithm Hadoop mapreduce

Tags big data check computer data example get google google +

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

During my hackbright mentor, I was asked to explain the mapreduce algorithm to students with limited technical backgrounds, so I came up with an interesting example to illustrate how it works.

Example

You want to count the number of spades in a stack of cards. The intuitive way is to have a single check and count out how many are spades.

The MapReduce method is:

Assign this stack to all the players present.

Let each player count the cards in their hands a few are spades and report this number to you.

You add up all the numbers the player tells you and get the final conclusion.

Background

In 2004, Google published a mapreduce algorithm that could analyze a lot of data. Whenever you hear the word "big data," it refers to a problem that is too big to be stored or analyzed by just one machine. MapReduce is able to solve most of the analysis problems related to large data by allocating the calculated amount to different computer groups. Hadoop provides the most popular way to use MapReduce algorithms to manage large data. Nowadays MapReduce is the mainstream.

So normally, whenever you hear "big data," that might mean that Hadoop is used to store data, and it usually means that data extraction and retrieval is mapreduce.

Split

MapReduce incorporates two classical functions:

Mapping (Mapping) applies the same action to each target in the collection. That is, if you want to multiply each cell in the form by two, the operation that applies the function individually to each cell belongs to mapping.

Simplification (reducing) traverses the elements in the collection to return a composite result. That is, the output form of a column of numbers and this task belongs to reducing.

Re-examine the example above

Re-examine our original example of a decentralized solitaire, and we have a basic approach to mapreduce data analysis. Friendly tip: This is not a rigorous example. In this example, people represent computers because they work at the same time, so they are a cluster. In most practical applications, we assume that the data is already on each computer-that is, distributing the cards is not a mapreduce step. (In fact, how to store files in a computer cluster is the real core of Hadoop.) )

By giving the cards to multiple players and letting them count, you perform the operation in parallel because each player counts at the same time. This also turns this work into a distributed one, because many different people do not need to know what their neighbors are doing to solve the same problem.

By telling everyone to count, you map a task that examines each card. You won't let them pass the spades cards to you, but let them simplify what you want to a number.

Another interesting situation is how evenly the cards are distributed. MapReduce assumes that the data is washed (shuffled)-if all the spades are in one hand, he may be able to count the cards much slower than the others.

If there are enough people, it's fairly easy to ask more interesting questions-such as "What is the average (21 point algorithm) of a stack of cards". You can get the answer by merging the two questions "what's the value of all the cards" and "How many cards we have". Using this and dividing the number of cards to get the average.

Conclusion

The mechanism of the mapreduce algorithm is much more complex than this, but the main idea is consistent-a large amount of data is analyzed by a decentralized computation. Whether it's Facebook, NASA or small startups, MapReduce is now the mainstream way to analyze Internet-level data. Interestingly, MapReduce tends to slow in more than 10PB data, so Google reported at their IO conference this year that MapReduce is not enough for them, please see this information, "Google abandoned MapReduce, the introduction of alternative Cloud dataflow."

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More