3.1 local Aggregation)
In a data-intensive distributed processing environment, interaction of intermediate results is an important aspect of synchronization from processes that generate them to processes that consume them at the end. In a cluster environment, except for the embarrassing parallel problem, data must be transmitted over the network. In addition, in hadoop, the intermediate result is first written to the local disk and then sent over the network. Because network and disk factors are more likely to be evaluated than other factors, reducing intermediate data transmission improves the algorithm efficiency. In mapreduce, clustering of local intermediate results is a way to improve algorithm efficiency. By using combiner and by using the ability to save different input states, this usually reduces the number of key-value pairs that need to be transmitted from mappers to javascers.
3.1.1 combiners and In-mapper combining
In section 2.2, we use simple word computing examples to illustrate various technologies of local aggregation. For convenience, Figure 3.1 reproduce the pseudocode of the simple basic algorithm: mapper treats each term as a key-Value Pair and uses the term itself as the key. The value is 1 and is sent out; the total number of CERs.
1: Class mapper
2: Method map (docid A, Doc D)
3: For all term t ε Doc d do
4: emit (term T, count 1)
1: Class CER
2: Method reduce (term T, counts [C1, C2,...])
3: sum limit 0
4: For all count C ε counts [C1, C2,...] Do
5: sum between sum + c
6: emit (term T, Count sum)
The first local clustering technique is combiner, which was discussed in section 2.4. In the mapreduce framework, combiner provides a common mechanism to reduce the number of median values generated by mappers-we can also consider them as the "Mini merge cers" for processing mappers output data ". In this example, the merge operator counts the number of words sent from each map task. This result reduces the number of key-value pairs that need to be transmitted over the network, from all words in the set to some words in the set.
1: Class mapper
2: Method map (docid A, Doc D)
3: H Branch New associativearray
4: For all term t ε Doc d do
5: H {t} 20.h {t} + 1 // tally counts for entire document
6: For all term t ε h do
7: emit (term T, Count H {t })
Figure 3.2: improved mapreduce word statistics algorithm, which uses an associated array to aggregate word counts based on each document. Reducer is the same as in Figure 3.1.
Figure 3.2 shows an improvement of this basic algorithm (mapper is modified, but reduceer is still not changed, so it is no longer repeated ). Associated arrays (such as map in Java) are introduced into mapper to calculate the word count in the document, rather than sending the key-value pairs of each word in the document, figure 3.2 the algorithms in this version only send different words in the document. Considering that some words often appear in the same document (for example, a document about a dog often contains the word "dog ), this can greatly reduce the number of key-value pairs to be sent, especially for long documents.
1: Class mapper
2: Method initialize
3: H Branch New associativearray
4: Method map (docid A, Doc D)
5: For all term t ε Doc d do
6: H {t} ← H {t} + 1
7: Method close
8: For all term t ε h do
9: emit (term T, Count H {t })
Figure 3.3 demonstrates the pseudo code of the mapreduce word statistics Algorithm in "in-mapper combining" mode, which is the same as Figure 3.1.
This basic idea can go further, just like the statistical algorithm for different words shown in Figure 3.3 (only mapper is modified. The work of this algorithm is heavily dependent on the details of MAP and reduce tasks executed in hadoop. Recall that each map task generates a Java mapper object that processes a bunch of key-value pairs. Before processing a key-value pair, the Mapper initialization method related to the User-Defined API is called. In this case, we initialize a group of associated arrays to record the number of words. (For each input key-Value Pair) since it can be maintained in the call of various map methods, we can continue to accumulate partial word counts among multiple documents by associating arrays, after the Mapper processes all the documents, it sends the key-value pairs. That is to say, the intermediate data is delayed until the close method in the pseudo code is run. Recall that this API provides an opportunity to assign key-value pairs in all input data in the Map Method to the map task for execution and then execute user-defined code.
With this technology, we actually integrate combiner functions into mapper. You no longer need to run a separate combiner, because all local aggregation conditions are included. This is a common design pattern in mapreduce, called in-mapper combining, so that we can refer to this pattern more conveniently in this book. We will see how this mode is applied to different problems later. This mode has two advantages:
First, it controls the occurrence time and execution process of local aggregation. In contrast, the semantics of combiner in mapreduce is reduced. For example, hadoop does not guarantee how many times the combiner will be called or whether it is called. As a provider for semantic optimization in the execution framework, combiner provides a (parameter) option to control whether to use it or how many times it is used (in the reduce stage as well ). In some cases (although not in this particular example), this unpredictable situation is unacceptable, so programmers often choose to execute partial aggregation in mappers.
Second, using map inside mapper is usually more efficient than using combiners. One of the reasons is the extra overhead caused by interaction with a specific key-value pair. Combiner reduces intermediate data transmitted over the network, but does not actually reduce the number of key-value pairs that mapper sends for the first time. In the algorithm in Figure 3.2, the intermediate key-value pair is still generated on the basis of each document, but is reduced by combiners. This process involves the creation and destruction of unnecessary objects (the spam mechanism can be easily handled ), it also includes Object serialization and deserialization (when the map output intermediate key value fills up the memory cache and needs to be temporarily stored on the disk ). In-mapper
In combining mode, mappers only generates key-value pairs that need to be transmitted to Cer CER over the network.
The in-mapper combining mode also has disadvantages. First, it breaks the foundation of mapreduce programming, because the State is saved through various Input key-value pairs. Finally, it is no big deal, because the concern about efficiency is often better than pure theory, but practice is also very important. The storage status of various input instances means that the algorithm behavior depends on the order in which key-value pairs are input. This leads to potential sorting dependency errors, which are usually difficult to debug large datasets (although the correctness of In-mapper combining is easily illustrated in the example of word statistics ). Second, using the in-mapper combining mode has an important expansion bottleneck. It relies heavily on storing intermediate results with sufficient memory until mapper fully processes all key-value pairs in the input. In the word statistics example, the amount of memory occupied depends on the dictionary size. In this case, in theory, mapper may encounter all words during collection. Heap's
Law, a familiar result in information retrieval, accurately simulates the effect of the increase in dictionary size on the size of the Set-a slightly unexpected fact is that the size of the dictionary is constantly expanding. As a result, the algorithm in Figure 3.3 will involve this, and the associated arrays outside of those will not be applicable to memory to maintain part of the word count.
When using in-mapper combining technology, a common practice to restrict memory usage is to block key-value pairs and periodically clear data in the memory. This idea is simple: After processing each n key-value pairs, this part of the result is sent, rather than sending the intermediate data after all key-value pairs are processed. This allows you to directly use a counter variable to track the number of key-value pairs that have been processed. As an option, mapper can know its memory usage and clear the intermediate key-value pairs when the memory usage is insufficient. The two methods, whether block size or the maximum memory usage (the memory usage threshold), must be used based on experience: If the value is too large, mapper runs out of memory, but if the value is too small, locally clustered data may be lost. Moreover, in hadoop, physical memory is allocated to multiple running tasks on the same node. These tasks are competing for this limited resource, however, because tasks are independent from each other, it is difficult to effectively coordinate the use of resources. In fact, due to the increasing buffer value, the return value is often gradually reduced during execution, so it is not worth the effort to find the optimal value for setting the cache (Jeff
Dean's personal opinion ).
In the mapduce algorithm, how much efficiency can be increased by using partial clustering depends on the size of the intermediate key value, the distribution of key values, and the number of key-value pairs required for each separate task. The aggregation time comes from the interaction between different values and the same key (whether or not the combiners or in-mapper combining mode is used ). In the word statistics example, partial aggregation is efficient because many words appear multiple times in a map task. Partial clustering is also an effective technique used to reduce the number of backend users (see Section 2.3), that is, the results and median are highly biased. In the word statistics example, we do not filter frequently occurring words: Therefore, if there is no partial aggregation, the CER that needs to calculate the number of "the" words does a lot of work compared to the typical reducer, so this reducer will become a drag-and-drop person. Use local aggregation (whether using combiners or in-mapper
Combining mode). In essence, we reduce the values associated with high-frequency vocabulary, thus mitigating the problem of dragging people.
3.1.2 correctness of the local Clustering Algorithm
Although the use of combiners can reduce the running time of algorithms, be especially careful when using them. Because combiner is regarded as an optimization option in hadoop, the correctness of the algorithm does not depend on the computation capability of combiner. In any mapreduce program, the reducer Input key-Value Pair type must match the Mapper output key-Value Pair type: this implies that the input and output key-value pairs of combiner must match the Mapper's output key-value pairs (like the CER's input key-Value Pair type ). In the case that reduce computing is interactive and collaborative, reducer can be used as a combiner (like the word statistics example) (without modification ). In general, combiners and javascers cannot be exchanged.
Consider a simple example: we have a large dataset. The input key is of the string type and the input value is of the integer type. We want to calculate the average value (or its approximate integer value) of all the same keys ). In the real world, the example may be the user operation log of a popular website. The key represents the user ID, and the value represents the activity of some users on the website, for example, the time spent in a session-this task is equivalent to calculating the average session time of each user, which will help you understand the population characteristics of users. Figure 3.4 shows the pseudocode of a simple algorithm that does not use combiner to complete this task. We use a special Mapper, which only transmits all key-value pairs to reducers (grouping and sorting as appropriate ). Reducer records the maximum value during running and the number of integers encountered. This information is used to calculate the average value after all values are processed. The average value is used as the output value of CER (the input string is used as the key ).
1: Class mapper
2: Method map (string T, integer r)
3: emit (string T, integer r)
1: Class CER
2: Method reduce (string T, integers [R1, R2,...])
3: sum limit 0
4: CNT defaults 0
5: For all integer r ε integers [R1, R2,...] Do
6: sum between sum + R
7: CNT rjcnt + 1
8: ravg divide sum/CNT
9: emit (string T, integer ravg)
Figure 3.4: mapreduce pseudocode for calculating the average value of the same key
This algorithm does work, but still has the disadvantages of the word statistics Algorithm in Figure 3.1: it needs to transmit all key-value pairs from mappers to reducers through the network, which is very inefficient. Unlike the word statistics example, reducer cannot be used as a combiner in this case. Imagine what would happen if we did this: combiner calculates a subset of any value of the same key, and CER calculates the average value of these results. For better illustration, we know:
Mean (1, 2, 3, 4, 5) = mean (1, 2), mean (3, 4, 5 ))
Generally, the average value obtained from any subset of a dataset is different from the average value obtained from the dataset. Therefore, this method does not get the correct value. (Of course, if the combiner calculates the average value of the same size subset in the Set, comibner = CER)
1: Class mapper
2: Method map (string T, integer r)
3: emit (string T, integer r)
1: Class combiner
2: method combine (string T, integers [R1, R2,...])
3: sum limit 0
4: CNT defaults 0
5: For all integer r ε integers [R1, R2,...] Do
6: sum between sum + R
7: CNT rjcnt + 1
8: emit (string T, pair (sum, CNT ))
1: Class CER
2: Method reduce (string T, pairs [(S1, C1), (S2, C2)...])
3: sum limit 0
4: CNT defaults 0
5: For all pair (S, C) ε pairs [(S1, C1), (S2, C2)...] Do
6: sum between sum + S
7: CNT + c
8: ravg divide sum/CNT
9: emit (string T, integer ravg)
Figure 3.5: Use the combiner mentioned earlier to calculate the average value of each key. The differences between input and output key-value pairs in combiner violate the mapreduce programming model.
So how can we reasonably use combiner? Figure 3.5 shows an attempt. Mapper remains unchanged, but we add a combiner to count some data of the value part of the value that needs to calculate the average value. This combiner receives the list of integer values corresponding to each key and calculates the total number of these values and the number of integer values (that is, the count ). The total number and count are packaged into a key-value pair with the same string as the key and sent as the combiner output. In CER, you can calculate their average values by sum and count. Until now, the key values in all our algorithms have been initialized (to string, integers ). However, mapreduce does not prohibit the use of more complex types (including user-defined types ). In fact, this represents a key technology in mapreduce algorithm design, which we have already introduced at the beginning of this chapter. We will frequently encounter complex key-value pairs in this book later.
Unfortunately, this algorithm cannot run. Remember that the combiner must have the same input/output key-Value Pair type. It must match the Mapper output type with the reducer input type. This is obviously not the case. To understand why such restrictions must be imposed in the programming model, remember that combiners is only an optimization method and does not change the correctness of the algorithm. So let's remove the combiner to see what will happen: mapper's output value type is integer, so CER wants to receive a list of integer types as values. But reducer actually wants to input a pair type list! The correctness of the algorithm may occur in the combiner running in the mappers output. To be more precise, the combiner is not just run once. Recall that we have discussed before that hadoop does not guarantee the number of times that combiners will be called; it may be zero, one or multiple times. This violates the mapreduce programming model.
1: Class mapper
2: Method map (string T, integer r)
3: emit (string T, pair (R, 1 ))
1: Class combiner
2: method combine (string T, pairs [(S1, C1), (S2, C2)...])
3: sum limit 0
4: CNT defaults 0
5: For all pair (S, C) ε pairs [(S1, C1), (S2, C2)...] Do
6: sum between sum + S
7: CNT + c
8: emit (string T, pair (sum, CNT ))
1: Class CER
2: Method reduce (string T, pairs [(S1, C1), (S2, C2)...])
3: sum limit 0
4: CNT defaults 0
5: For all pair (S, C) ε pairs [(S1, C1), (S2, C2)...] Do
6: sum between sum + S
7: CNT + c
8: ravg divide sum/CNT
9: emit (string T, integer ravg)
Figure 3.6: mapreduce pseudocode that calculates the average value of each key. This code is cleverly implemented using combiners.
Figure 3.6 shows another Implementation of the algorithm. At this time, the algorithm is correct. In Mapper, we send a pair composed of integer and 1 -- this is equivalent to the sum of partial counts in an instance. Combiner collects the total and partial counts respectively, and then sends the updated total and count. CER is similar to combiner. The only difference is that it counts the final result. In fact, this algorithm converts an operation that cannot be passed (calculate the average value of the number) into an operation that can be passed (the total number of elements in a pair of numbers increases only one additional allocation ).
Let's repeat the previous exercises to verify the correctness of this algorithm: What happens when no combiners is running? Without combiners, mappers sends pairs (as a value) to reducers. In this case, the number of median pairs is the same as the number of Input key-value pairs, and each pair value includes an integer value and 1. CER still calculates the correct total number and count, so the average value obtained is correct. Now add combiners: this algorithm is still correct, no matter how many times it runs, because combiners only counts the total number and count and then directly transmits them to Cer. It is worth noting that although the combiner output key-Value Pair type must be the same as the reducer Input key-Value Pair type, the reducer can use different data types to send key-value pairs.
1: Class mapper
2: Method initialize
3: s Branch New associativearray
4: C Branch New associativearray
5: Method map (string T, integer r)
6: s {t} ← s {t} + R
7: c {t} ← c {t} + 1
8: Method close
9: For all term t ε s do
10: emit (term T, pair (s {t}, c {t }))
Figure 3.7: pseudocode used to calculate the average value of an associated value
Finally, Figure 3.7 shows us a more efficient algorithm to implement the in-mapper combining mode. In Mapper, the total number and count of key-related parts are stored in memory through key-value pairs. An intermediate key-value pair is sent only after all input is processed. The value is the same as that in the previous example. The value is a pair composed of the total number and count. Reducer is the same as in Figure 3.6. Moving part of the result statistics from combiner to mapper is discussed before this section. However, in this example, the data structure that stores intermediate data is moderate in memory usage, this makes the mutation algorithm more attractive.