How to explain MapReduce algorithm simply

Source: Internet
Author: User
Keywords Algorithm Hadoop mapreduce
Tags big data check computer data example get google google +

During my hackbright mentor, I was asked to explain the mapreduce algorithm to students with limited technical backgrounds, so I came up with an interesting example to illustrate how it works.

Example

You want to count the number of spades in a stack of cards. The intuitive way is to have a single check and count out how many are spades.

The MapReduce method is:

Assign this stack to all the players present.

Let each player count the cards in their hands a few are spades and report this number to you.

You add up all the numbers the player tells you and get the final conclusion.

Background

In 2004, Google published a mapreduce algorithm that could analyze a lot of data. Whenever you hear the word "big data," it refers to a problem that is too big to be stored or analyzed by just one machine. MapReduce is able to solve most of the analysis problems related to large data by allocating the calculated amount to different computer groups. Hadoop provides the most popular way to use MapReduce algorithms to manage large data. Nowadays MapReduce is the mainstream.

So normally, whenever you hear "big data," that might mean that Hadoop is used to store data, and it usually means that data extraction and retrieval is mapreduce.

Split

MapReduce incorporates two classical functions:

Mapping (Mapping) applies the same action to each target in the collection. That is, if you want to multiply each cell in the form by two, the operation that applies the function individually to each cell belongs to mapping.

Simplification (reducing) traverses the elements in the collection to return a composite result. That is, the output form of a column of numbers and this task belongs to reducing.

Re-examine the example above

Re-examine our original example of a decentralized solitaire, and we have a basic approach to mapreduce data analysis. Friendly tip: This is not a rigorous example. In this example, people represent computers because they work at the same time, so they are a cluster. In most practical applications, we assume that the data is already on each computer-that is, distributing the cards is not a mapreduce step. (In fact, how to store files in a computer cluster is the real core of Hadoop.) )

By giving the cards to multiple players and letting them count, you perform the operation in parallel because each player counts at the same time. This also turns this work into a distributed one, because many different people do not need to know what their neighbors are doing to solve the same problem.

By telling everyone to count, you map a task that examines each card. You won't let them pass the spades cards to you, but let them simplify what you want to a number.

Another interesting situation is how evenly the cards are distributed. MapReduce assumes that the data is washed (shuffled)-if all the spades are in one hand, he may be able to count the cards much slower than the others.

If there are enough people, it's fairly easy to ask more interesting questions-such as "What is the average (21 point algorithm) of a stack of cards". You can get the answer by merging the two questions "what's the value of all the cards" and "How many cards we have". Using this and dividing the number of cards to get the average.

Conclusion

The mechanism of the mapreduce algorithm is much more complex than this, but the main idea is consistent-a large amount of data is analyzed by a decentralized computation. Whether it's Facebook, NASA or small startups, MapReduce is now the mainstream way to analyze Internet-level data. Interestingly, MapReduce tends to slow in more than 10PB data, so Google reported at their IO conference this year that MapReduce is not enough for them, please see this information, "Google abandoned MapReduce, the introduction of alternative Cloud dataflow."

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.