MapReduce is inspired by functional programming. Map and reduce are two common functions in functional programming. In functional programming, the map function performs operations or functions on each element in the list. For example, executing the multiple-by-two function on the list [1, 2, 3, 4] generates another list [2, 4, 6, 8]. When these functions are executed, the original list is not modified. Functional programming considers that data should be kept unchangeable to avoid sharing data among multiple processes or threads. This means that although the previously demonstrated map function is very simple, it can be executed simultaneously on the same list through two or more threads, because the List itself has not changed.
Similar to map functions, function programming also has the concept of reduce functions. In fact, the wider name of reduce in functional programming is the fold function. The reduce or fold function is also called the accumulate, compress, or inject function. The reduce or fold function executes a function on all elements in a data structure (such as a list) and returns a single result or output. Therefore, when the reduce sum is executed on the map function output list [2, 4, 6, 8], a single output value of 20 is obtained.
Map and reduce functions can be used together to Process List data. First, one function is executed for each member of the list, and then another aggregate function is executed for the list generated by conversion.
The concise idea of map and reduce can be used in big data sets. You only need to modify it slightly to adapt to a set composed of tuple or key-value pairs. The map function executes the function on each key-value pair in the set and generates a new set. Then, the reduce function performs aggregation on the newly generated set to calculate the final result. An example is better than a thousand words. Here is a simple example to explain the entire process. Suppose there is a set composed of key-value pairs:
[{"94303":"Tom"},{"94303":"Jane"},{"94301":"Arun"},{"94302":"Chen"}]
The key is the zip code, and the value is the name of the resident within the zip code. If a map function is executed on the set to obtain the names of all residents within the specified zip code range, the map function outputs the following:
[{"94303":["Tom","Jane"]},{"94301":["Arun"]},{"94302":["Chen"]}]
Then, the preceding output executes a reduce function to calculate the total number of residents within the specified zip code range. The final output is as follows:
[{"94303":2},{"94301":1},{"94302":1}]
1. The model should be similar to multithreading. Several independent tasks are executed independently. Only multithreading is vertical scaling, while MapReduce is horizontal scaling.
2. serialization is required to transmit data on different nodes. For better serialization, the data structure implemented in JAVA is not needed, for example, replacing String with Text.
Suitable for scenarios, which can be split into several tasks without the dependency between tasks
1. query the most popular words
- Count the frequency of occurrence of words for each task.
- Summarize the statistical results of each task, and then obtain the first K
2. Bayesian Classification
Bayesian classification is a statistical classification method that uses probability statistics knowledge for classification. This method consists of two steps: training samples and classification.
Three MapReduce jobs can be used for implementation.
Unsuitable scenarios
For example, the most typical Fibonacci series is not suitable for scenarios with dependencies between the two sides.