Data-intensive Text Processing with mapreduce Chapter 3 (2)-mapreduce algorithm design-3.1 partial aggregation

Source: Internet
Author: User
Tags object serialization

3.1 local Aggregation)

 

In a data-intensive distributed processing environment, interaction of intermediate results is an important aspect of synchronization from processes that generate them to processes that consume them at the end. In a cluster environment, except for the embarrassing parallel problem, data must be transmitted over the network. In addition, in hadoop, the intermediate result is first written to the local disk and then sent over the network. Because network and disk factors are more likely to be evaluated than other factors, reducing intermediate data transmission improves the algorithm efficiency. In mapreduce, clustering of local intermediate results is a way to improve algorithm efficiency. By using the combiner and the ability to save different input states, this generally reduces the number of key-value pairs that need to be transmitted from mappers to javascers.

 

3.1.1 combiners and In-mapper combining)

 

In section 2.2, we use simple word computing examples to illustrate various technologies of local aggregation. For convenience, Figure 3.1 reproduce the pseudocode of the simple basic algorithm: mapper treats each term as a key-Value Pair and uses the term itself as the key. The value is 1 and is sent out; the total number of CERs.

 

1: Class mapper

2: Method map (docid A, Doc D)

3: For all term t ε Doc d do

4: emit (term T, count 1)

 

1: Class CER

2: Method reduce (term t; counts [C1, C2,...])

3: sum limit 0

4: For all count C ε counts [C1, C2,...] Do

5: sum between sum + c

6: emit (term T, Count sum)

Figure 3.1: pseudo code of mapreduce word statistics algorithm (Figure 2.3 reproduction)

 

 

The first local clustering technique is combiner, which was discussed in section 2.4. In the mapreduce framework, combiner provides a common mechanism to reduce the number of median values generated by mappers-we can also consider them as the "Mini merge cers" for processing mappers output data ". In this example, the merge operator counts the number of words sent from each map task. This result reduces the number of key-value pairs that need to be transmitted over the network, from all words in the set to some words in the set.

 

1: Class mapper

2: Method map (docid A, Doc D)

3: H Branch New associativearray

4: For all term t ε Doc d do

5: H {t} 20.h {t} + 1 // tally counts for entire document

6: For all term t ε h do

7: emit (term T, Count H {t })

Figure 3.2: improved mapreduce word statistics algorithm, which uses an associated array to aggregate word counts based on each document. Reducer is the same as in Figure 3.1.

 

Figure 3.2 shows the improved basic algorithm (mapper is modified, reducer is the same as Figure 3.1 ). Join arrays (for example, map in Java) are introduced in Mapper to calculate the word count in the document: instead of sending the key-value pairs of each word in the document. Considering that some words often appear in the same document (for example, a document about a dog often contains the word "dog ), this can greatly reduce the number of key-value pairs to be sent, especially for long documents.

 

This basic concept can go further. As a variant of the word statistics Algorithm in Figure 3.3, the operation of this algorithm is heavily dependent on how map and reduce are executed in hadoop as described in section 2.6. Recall that each map task generates a Java mapper object that processes a bunch of key-value pairs. Before processing a key-value pair, the Mapper's API initialization method linked to the user's custom code is called. In this case, we initialize a group of associated arrays to record the number of words. (For each input key-Value Pair) since it can be maintained in the call of various map methods, we can continue to accumulate partial word counts among multiple documents by associating arrays, after the Mapper processes all the documents, it sends the key-value pairs. That is to say, the intermediate data is delayed until the close method in the pseudo code is run. Recall that this API provides an opportunity to assign key-value pairs in all input data in the Map Method to the map task for execution and then execute user-defined code.

 

With this technology, we actually integrate combiner functions into mapper. You no longer need to run a separate combiner, because all local aggregation has been proven. This is a common design pattern in mapreduce, called in-mapper combining, so that we can refer to this pattern more conveniently in this book. We will see how this mode is applied to different problems later. This mode has two main advantages:

 

First, it can control when a local aggregation occurs and how it occurs. In contrast, the combiner syntax is executed under the control of mapreduce.

 

 

For example, hadoop does not guarantee how many times the combiner will be used, that is, it will even be used every time. Combiner is used as a provider for semantic optimization in the execution framework. It may be used many times, or not (or in the reduce stage ). In some cases (although not in this particular example), this unpredictable situation is unacceptable, so programmers often choose to execute partial aggregation in mappers.

 

 

Secondly, using in-mapper combining is usually more efficient than using combiners. One of the reasons is the extra overhead caused by interaction with a specific key-value pair. Combiner reduces intermediate data transmitted over the network, but does not actually reduce the number of key-value pairs issued for the first time. In the algorithm in Figure 3.2, the intermediate key-value pair is still generated on the basis of each document, but is reduced by combiners. This process involves the creation and destruction of unnecessary objects (the spam mechanism can be easily handled ), it also includes Object serialization and deserialization (when the map output intermediate key-value pair is full of memory cache and needs to be temporarily stored on the disk ). In-mapper combining mode mappers only generates key-value pairs that need to be transmitted to Cer CER over the network.

 

 

Let's talk about the disadvantages of the in-mapper combining mode in any case. First, it breaks the foundation of mapreduce programming, because the State is saved through various Input key-value pairs. Finally, it is no big deal, because the concern about efficiency is often better than pure theory, but practice is also very important. The storage status of various input instances means that the algorithm behavior depends on the order in which key-value pairs are input. In this way, potential sorting dependency errors are easily generated, and it is difficult to debug large datasets under normal circumstances (although the correctness of In-mapper combining is easily illustrated in the word statistics example ). Second, using the in-mapper combining mode has an important expansion bottleneck. It relies heavily on storing intermediate results with sufficient memory until mapper fully processes all key-value pairs in the input. In the word statistics example, the memory usage depends on the dictionary size. In this case, mapper may encounter all words during collection. Heap's law, a familiar result in information retrieval, accurately simulates the effect of the increase in dictionary size on the set size. A slightly unexpected fact is that the dictionary size is constantly expanding. As a result, the algorithm in Figure 3.3 will involve this, and the associated arrays outside of those will not be applicable to memory to maintain part of the word count.

 

1: Class mapper

2: Method initialize

3: H Branch New associativearray

4: Method map (docid A, Doc D)

5: For all term t ε Doc d do

6: H {t} else h {t} + 1 // tally counts internal SS documents

7: Method close

8: For all term t ε h do

9: emit (term T, Count H {t })

Figure 3.3 demonstrates the pseudo code of the mapreduce word statistics Algorithm in "in-mapper combining" mode, which is the same as Figure 3.1.

 

4. More details: note that part of the term count occurs. However, when the set size increases, one of them will want to increase the input split size value to limit the number of added map tasks.

 

One common practice of using in-mapper combining technology to limit memory usage is to block key-value pairs and regularly clear data in the memory. This idea is simple: After processing each n key-value pairs, this part of the result is sent, rather than sending the intermediate data after all key-value pairs are processed. This allows you to directly use a counter variable to track the number of key-value pairs that have been processed. As an option, mapper can know its memory usage and clear the intermediate key-value pairs when the memory usage is insufficient. The two methods, whether block size or the maximum memory usage (the memory usage threshold), must be used based on experience: If the value is too large, mapper runs out of memory, but if the value is too small, locally clustered data may be lost. Moreover, in hadoop, physical memory is allocated to multiple running tasks on the same node. These tasks are competing for this limited resource, however, because tasks are independent from each other, it is difficult to effectively coordinate the use of resources. In fact, due to the increasing buffer value, the return value is often reduced in the execution process, so it is not worth the effort to find the optimal value for setting the cache (Jeff Dean's personal opinion ).

 

 

In the mapduce algorithm, how much efficiency can be increased by using partial clustering depends on the size of the median, the distribution of key values, and the number of key-value pairs that each individual task needs to send. The timing of clustering is ultimately derived from different values of the same key (whether or not the combiners or in-mapper combining mode is used ). In the word statistics example, partial aggregation is efficient because many words appear multiple times in a map task. Partial clustering is also an effective technique used to reduce the number of backend users (see Section 2.3), that is, the results and median are highly biased. In the word statistics example, we do not filter frequently-occurring words: Therefore, if there is no partial aggregation, We need to calculate the number of reducers to do a lot of work compared to the common reducers, therefore, this CER will become the owner. Using local clustering (whether in combiners or in-mapper combining mode), we reduce the value produced by frequently-occurring words in general, so as to alleviate the problem of slowing down.

 

3.1.2 correctness of the local Clustering Algorithm

 

Although the use of combiners can effectively reduce the time consumed by algorithms, you should be especially careful when using them. Because combiner is a selective Optimization Method in hadoop, the correctness of the algorithm does not depend on the computation capability of combiner. In any mapreduce program, the reducer Input key-Value Pair type must match the Mapper output key-Value Pair type: this implies that the input and output key-value pairs of combiner must match the Mapper's output key-value pairs (like the CER's input key-Value Pair type ). In the case that reduce computing is interactive and collaborative, reducer can be used as a combiner (like the word statistics example) (without modification ). In general, combiners and javascers cannot be exchanged.

 

For example, we have a large dataset. The input key is of the string type and the input value is of the integer type. We want to calculate the average value of all the same keys (rounded to the nearest integer ). In the real world, the example may be the user operation log of a popular website. The key represents the user ID, and the value represents the activity of some users on the website, such as the time spent in a session. This is equivalent to computing the average session time based on each user, which will help you understand the user's features. Figure 3.4 shows the pseudocode of a simple algorithm that does not use combiner to complete this task. We use a special Mapper, which only transmits all key-value pairs to reducers (grouping and sorting as appropriate ). Reducer records the maximum value during running and the number of integers encountered. This information is used to calculate the average value after all values are processed. The average value is used as the output value of CER (the input string is used as the key ).

 

1: Class mapper

2: Method map (string t; integer r)

3: emit (string t; integer r)

 

1: Class CER

2: Method reduce (string t; integers [R1, R2,...])

3: sum limit 0

4: CNT defaults 0

5: For all integer r ε integers [R1, R2,...] Do

6: sum between sum + R

7: CNT rjcnt + 1

8: ravg divide sum/CNT

9: emit (string T, integer ravg)

Figure 3.4: mapreduce pseudocode for calculating the average value of the same key

 

 

This algorithm does work, but still has the disadvantages of the word statistics Algorithm in Figure 3.1: it needs to transmit all key-value pairs from mappers to reducers through the network, which is very inefficient. Unlike the word statistics example, reducer cannot be used as a combiner in this case. Imagine what would happen if we did this: combiner calculates a subset of any value of the same key, and CER calculates the average value of these results. For better illustration, we know:

Mean (1; 2; 3; 4; 5 )! = Mean (1; 2), mean (3; 4; 5 ))

Generally, the average value obtained from any subset of a dataset is different from the average value obtained from the dataset. Therefore, this method does not get the correct value.

 

1: Class mapper

2: Method map (string T, integer r)

3: emit (string T, integer r)

 

1: Class combiner

2: method combine (string T, integers [R1, R2,...])

3: sum limit 0

4: CNT defaults 0

5: For all integer r 2 integers [R1, R2,...] Do

6: sum between sum + R

7: CNT rjcnt + 1

8: emit (string T, pair (sum, CNT) // total distribution and count

 

1: Class CER

2: Method reduce (string T, pairs [(S1, C1), (S2, C2)...])

3: sum limit 0

4: CNT defaults 0

5: For all pair (S, C) ε pairs [(S1, C1), (S2, C2)...] Do

6: sum between sum + S

7: CNT + c

8: ravg divide sum/CNT

9: emit (string T, integer ravg)

Figure 3.5:

We try to use the combiner we mentioned earlier to calculate the average value of each key. The differences between input and output key-value pairs in combiner violate the mapreduce programming model.

 

So how can we reasonably use combiner? Figure 3.5 makes an attempt. Mapper remains unchanged, but we add a combiner to count some data of the value part of the value that needs to calculate the average value. This combiner receives each key value and the corresponding integer value list, and calculates the total number of these values and the number of integer values (that is, the count ). The total number and count are packaged into a key-value pair with the same string as the key and sent as the combiner output. In CER, until now, the key values in all our algorithms have been initialized (changed to string, integers ). However, mapreduce does not prohibit the use of more complex types. In fact, this represents a key technology in mapreduce algorithm design, which we have already introduced at the beginning of this chapter. We will frequently encounter complex key-value pairs in this book later.

 

Unfortunately, this algorithm cannot run. Remember that the combiner must have the same input/output key-Value Pair type. It must match the Mapper output type with the reducer input type. This is obviously not the case. To understand why such restrictions must be imposed in the programming model, remember that combiners is only an optimization method and does not change the correctness of the algorithm. So let's remove combiners to see what will happen: mapper's output value type is integer, so CER wants to receive a list of integer types as values. But reducer actually wants to input a pair type list! The correctness of the algorithm may occur in the combiner running in the mappers output. To be more precise, the combiner is not just run once. Recall that we have discussed before that hadoop does not guarantee the number of times that combiners will be called; it may be zero, one or multiple times. This violates the mapreduce programming model.

1: Class mapper

2: Method map (string T, integer r)

3: emit (string T, pair (R, 1 ))

 

1: Class combiner

2: method combine (string T, pairs [(S1, C1), (S2, C2)...])

3: sum limit 0

4: CNT defaults 0

5: For all pair (S, C) 2 pairs [(S1, C1), (S2, C2)...] Do

6: sum between sum + S

7: CNT + c

8: emit (string T, pair (sum, CNT ))

 

1: Class CER

2: Method reduce (string T, pairs [(S1, C1), (S2, C2)...])

3: sum limit 0

4: CNT defaults 0

5: For all pair (S, C) 2 pairs [(S1, C1), (S2, C2)...] Do

6: sum between sum + S

7: CNT + c

8: ravg divide sum/CNT

9: emit (string T, integer ravg)

Figure 3.6: mapreduce pseudocode that calculates the average value of each key. This code is cleverly implemented using combiners.

 

Another disadvantage of the algorithm is shown in Figure 3.6. At this time, the algorithm is correct. In Mapper, we send the pair value composed of integer and 1, which is equivalent to the sum of partial counts in an instance. Combiner collects the total and partial counts respectively, and then sends the updated total and count. CER is similar to combiner. The only difference is that it counts the final result. In fact, this algorithm converts an operation that cannot be passed (calculate the average value of a number) into an operation that can be passed (the total number of elements in a pair, with an additional allocation at the end ).

 

Let's repeat the previous practices to verify the correctness of this algorithm: what will happen when there is no combiners runtime? Without combiners, mappers sends pairs (as a value) to reducers. In this case, the number of median pairs is the same as the number of Input key-value pairs, and each pair value includes an integer value and 1. reducer will still reach the correct total number and count, so the average value is correct. Now add combiners: this algorithm is still correct, no matter how long it runs, because combiners only counts the total number and count and then directly transmits them to Cer. It is worth noting that although the combiner output key-Value Pair type must be the same as the reducer Input key-Value Pair type, the reducer can use different data types to send key-value pairs.

 

1: Class mapper

2: Method initialize

3: s Branch New associativearray

4: C Branch New associativearray

5: Method map (string T, integer r)

6: s {t} ← s {t} + R

7: c {t} ← c {t} + 1

8: Method close

9: For all term t ε s do

10: emit (term T, pair (s {t}, c {t }))

Figure 3.7: pseudocode used to calculate the average value of an associated value

 

Finally, in Figure 3.7, we show a more efficient algorithm to implement the in-mapper combining mode. In Mapper, the total number and count of key-related parts are stored in memory through key-value pairs. An intermediate key-value pair is sent only after all input is processed. The value is the same as that in the previous example. The value is a pair composed of the total number and count. Reducer is the same as in Figure 3.6. Moving part of the result statistics from combiner to mapper is discussed before this section. However, in this example, the data structure that stores intermediate data is moderate in memory usage, this makes the mutation algorithm more attractive.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.