During my hackbright mentor, I was asked to explain the mapreduce algorithm to students with limited technical backgrounds, so I came up with an interesting example to illustrate how it works.
Example
You want to count the number of spades in a stack of cards. The intuitive way is to have a single check and count out how many are spades.
The MapReduce method is:
Assign this stack to all the players present.
Let each player count the cards in their hands a few are spades and report this number to you.
You add up all the numbers the player tells you and get the final conclusion.
Background
In 2004, Google published a mapreduce algorithm that could analyze a lot of data. Whenever you hear the word "big data," it refers to a problem that is too big to be stored or analyzed by just one machine. MapReduce is able to solve most of the analysis problems related to large data by allocating the calculated amount to different computer groups. Hadoop provides the most popular way to use MapReduce algorithms to manage large data. Nowadays MapReduce is the mainstream.
So normally, whenever you hear "big data," that might mean that Hadoop is used to store data, and it usually means that data extraction and retrieval is mapreduce.
Split
MapReduce incorporates two classical functions:
Mapping (Mapping) applies the same action to each target in the collection. That is, if you want to multiply each cell in the form by two, the operation that applies the function individually to each cell belongs to mapping.
Simplification (reducing) traverses the elements in the collection to return a composite result. That is, the output form of a column of numbers and this task belongs to reducing.
Re-examine the example above
Re-examine our original example of a decentralized solitaire, and we have a basic approach to mapreduce data analysis. Friendly tip: This is not a rigorous example. In this example, people represent computers because they work at the same time, so they are a cluster. In most practical applications, we assume that the data is already on each computer-that is, distributing the cards is not a mapreduce step. (In fact, how to store files in a computer cluster is the real core of Hadoop.) )
By giving the cards to multiple players and letting them count, you perform the operation in parallel because each player counts at the same time. This also turns this work into a distributed one, because many different people do not need to know what their neighbors are doing to solve the same problem.
By telling everyone to count, you map a task that examines each card. You won't let them pass the spades cards to you, but let them simplify what you want to a number.
Another interesting situation is how evenly the cards are distributed. MapReduce assumes that the data is washed (shuffled)-if all the spades are in one hand, he may be able to count the cards much slower than the others.
If there are enough people, it's fairly easy to ask more interesting questions-such as "What is the average (21 point algorithm) of a stack of cards". You can get the answer by merging the two questions "what's the value of all the cards" and "How many cards we have". Using this and dividing the number of cards to get the average.
Conclusion
The mechanism of the mapreduce algorithm is much more complex than this, but the main idea is consistent-a large amount of data is analyzed by a decentralized computation. Whether it's Facebook, NASA or small startups, MapReduce is now the mainstream way to analyze Internet-level data. Interestingly, MapReduce tends to slow in more than 10PB data, so Google reported at their IO conference this year that MapReduce is not enough for them, please see this information, "Google abandoned MapReduce, the introduction of alternative Cloud dataflow."