Data-intensive Text Processing with mapreduce chapter 3rd: mapreduce Algorithm Design (1)

Source: Internet
Author: User
Tags object serialization

Great deal. I was supposed to update it yesterday. As a result, I was too excited to receive my new focus phone yesterday and forgot my business. Sorry!

Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html

Introduction

Mapreduce is very powerful because of its simplicity. Programmers only need to prepare the following elements:

  1. Input data
  2. Mapper and reducer
  3. Splitter and merge (these two are optional)

But on the other hand, this meansAll algorithms that require mapreduce implementation can be expressed in the mapreduce model (Map-Reduce).

This chapter uses some simple examples to introduce how to handle some common situations in mapreduce (that is, the "Design Pattern" in mapreduce "). Some concepts and methods in this chapter will be used in subsequent chapters.

Synchronization may be the most difficult to understand and master in mapreduce. In fact, there is only one synchronization problem in the processing of mapreduce: after the Mapper completes processing, it groups and sorts the key-value pairs of the intermediate output results and sends them to the corresponding CER stage (shuffle and sort ). Mappper/reducer runs independently and is unrelated to each other in other stages. Therefore, many factors cannot be controlled by programmers:

  1. On which node A specific mapper/reducer runs
  2. When And when each mapper/reducer starts to work
  3. Which specific er is used to process each input key-value pair?
  4. The specific CER to which the key-value pair of each intermediate result is processed

But even so, programmers still have some other means to control the mapreduce execution mode and data flow:

  1. Use a custom data structure to store additional information/communication
  2. Perform custom preprocessing/post-processing operations
  3. Maintains information related to the current context in a Mapper/reducer object.
  4. Controls the grouping/sorting method of intermediate results to control the sequence in which intermediate results enter CER Cer.
  5. Controls the behavior of the splitter, so as to control what kind of CER accepts what kind of intermediate results

In fact, many algorithms cannot be simply represented as one-step map-reduce operations. In this case, you need to carefully design the algorithm and divide it into multiple map-Reduce steps. The output results of the reduce stage in the previous step can be used as the input of the next map..

Many algorithms are iterative: An algorithm performs an Operation repeatedly before the termination condition is met. Iteration conditions are not easy to represent in mapreduce. In this case, an external program needs to be introduced as a "Drive" to coordinate mapreduce-based iteration operations.

This chapter focuses on two points:

  1. Scalability: ensures that algorithms run on larger datasets without bottlenecks
  2. Algorithm efficiency: Make sure that the algorithm does not consume too much unnecessary resources (such as network transmission) in terms of parallelism itself ). The ideal efficiency is that there is no concurrency overhead, and the algorithm has linear scalability, that is, the cluster Scale remains unchanged. When the data scale doubles, the processing time also doubles. When the data scale remains the same and the cluster Scale doubles, the computing time is halved.

This chapter is arranged as follows:

Section 3.1This section describes the importance of local aggregation and details the combiner. Local merge can merge mapper output results to reduce the amount of data that needs to be transmitted over the network. This section describes the "in-mapper combining" design mode.

Section 3.2The example of constructing the word co-orrcurance matrices illustrates two common design modes: pair and strip ). This is a common mode of data structure in organization key-value. These two models are of great significance for solving the problem of tracing collaborative events from a large number of sample observations.

Section 3.3Further, change the problem in section 3.2 to the relative frequency matrix to introduce the order inversion mode. The reverse order mode improves the algorithm efficiency by converting the operational order problem in the algorithm into the Sorting Problem of intermediate results in mapreduce. Generally, algorithms that contain statistical functions need to traverse the dataset repeatedly (for example, normalization, which requires scanning the calculation sum once and then performing the division for the second scan ), by cleverly defining sorting rules (using reverse order mode), you can avoid the time overhead of repeated traversal and the space overhead of maintaining statistical results.

Section 3.4A general secondary sorting method is introduced, which can be used to sort the sequence of the key-value accepted by the reducer with the same key value. We call this mode "value-to-key conversion )".

Section 3.5This article describes how to perform join operations on relational data, including reduce-side and map-side) and memory-based (memory-backed.

3.1 local merge

The most important aspect of large-scale data processing is data transmission (in mapreduce, it is from the data producer Mapper to the transmitter reducer ). Transmission usually includes disk I/O and network transmission. In hadoop, intermediate results are usually written to the disk before being transmitted to Cer CER over the network. Whether it is disk latency or network latency, the computing efficiency relative to the CPU is relatively large overhead. WhilePartial merge can merge intermediate results to reduce the size of data to be written/transmitted, thus improving algorithm efficiency.. Good local merge operations are crucial to efficient mapreduce algorithms.

Local merge is achieved through two methods:

  1. Design a suitable combiner to merge the intermediate results generated by Er er.
  2. Use context information in a single mapper

Local merge can effectively reduce the number and size of key-value pairs in the intermediate results, thus improving the algorithm efficiency.

3.1.1 combiner and In-mapper combining)

Take the word count statistical program in section 2.2 as an example to illustrate the principle of local aggregation. For convenience, we will list the pseudocode at that time again, as shown in Figure 3.1. er outputs (term, 1) pairs. the reducer accepts these pairs, adds 1 with the same term value, and finally outputs the Count of each term.

Figure 3.1 pseudocode of the basic word statistics Algorithm

One method for local merge is to use the combiner (as mentioned in section 2.4). Combiner is an optimization mechanism used to reduce the number of key-value pairs of intermediate results generated by mapper. Combiner can be understood as a "mini CER". Each combiner processes all the key-value pairs output by a mapper. In this example, after combiner merge, the number of key-value pairs output by each mapper ranges fromNumber of all wordsDropNumber of all different words.

Figure 3.2 shows an improved algorithm for basic word statistics. The algorithm only improves Mapper, so only the modified er pseudo code is listed (reducer omitted ):

3.2 Improved word statistics algorithm using an association

The array is the word count and the counting range is within a document.

In Mapper, an algorithm uses an associated array (MAP in Java) as the word count, and the counting range is within a document. After improvement, mapper is no longer inEach wordOutput a key-value, insteadDifferent wordsOutput a key-value. Since almost every document contains a large number of repeated words (such as the), this method can greatly reduce intermediate results, especially for large documents.

We can further design the hadoop programming features. According to section 2.6, each map task generates a Mapper object (Java object). Before the Mapper object starts processing, you can trigger the custom initialization operation (initialize ). In this example, the custom initialization operation is "generate an empty join array ". Furthermore, we can expand the counting range of the associated array so that it is not limited to a single document, but to all the documents processed by a Mapper object. After the Mapper completes all the documents, it can trigger the User-Defined post-processing operation (close). Here, the close operation can be defined as "traversing the associated array and outputting intermediate results ". The improved algorithm 3.3 is shown in.

3.3 further improved word count Algorithm

In this algorithm, we have not implemented an independent combiner. By combining only one function included in the Mapper, we have become a local merging target that can be achieved in this algorithm.

Advantages of merge within mapper

We call this mode"In-mapper combining)". This mode hasTwo advantages:

  1. Precise control over when (when to merge) And how (how to merge.In contrast, the execution semantics of the independent combiner is not strictly defined in mapreduce: The execution framework only uses the combiner as an optional execution step. It is uncontrollable to determine whether or not to execute it several times. Therefore, many programmers choose to implement their own mapper merge method instead of independent combiner.
  2. Compared with independent combiner, mapper merge is more efficient.Although the independent combiner can reduce the intermediate result key-value pairs transmitted over the network, it cannot reduce the key-value pairs generated by mapper (because the independent combiner runs after the map step is completed ). This prevents independent combiner from reducing the number of objects generated and destroyed (Object generation and garbage collection GC are also overhead, and when there are many objects, this overhead must be met ). Not only that, because the intermediate results generated by mapper are first written to the disk, too many objects will also lead to the increase of Object serialization/deserialization and disk I/O. Mapper merge is effective for solving these problems.

Disadvantages of merge within mapper

HoweverMapper merge also has disadvantages:

  1. In essence, the merge in mapper is implemented by maintaining state records within the Mapper object, which undermines the features of functional programming (NOTE: See the definition http://en.wikipedia.org/wiki/Pure_function of pure function in wiki ). In actual programming, it is very common to sacrifice some theoretical purity for the sake of efficiency. For mapper merge, this is completely worthwhile.
  2. Maintenance of state records between multiple input instances (which are key-value pairs in mapreduce) means that the behavior of algorithms may be related to the order of input sequences. This may cause a program bug, and this bug will be difficult to debug when the dataset is large (of course, in the word count algorithm, the input order does not obviously affect the correctness of the algorithm ).
  3. Poor Scalability: mapper merge is limited by the memory size. For word statistics algorithms, if a Mapper receives a large number of words in the input set (the number of words here refers to the number of different words, that is, number of unique words ), the associated array for maintaining the word count in the Mapper object will be large. Therefore, for the algorithm shown in Figure 3.3, there is an upper limit (the upper limit is the maximum memory that a single mapper node can use ).

There are also some ways to solve this problem, such as periodically writing data in the memory to the disk. For word statistics, if there is not enough memory for Mapper to input all input statistics, then every n key-value pairs can be used as a batch of processing methods: after processing n key-value pairs, the array associated with the current count is written to the disk and cleared. The memory occupied by the associated array actually acts as a buffer. The size of the buffer has a great impact on algorithm performance. Due to limited memory, the buffer cannot be too large. Therefore, you need to carefully select the buffer size based on the computing resources and algorithms.

Summary of local merge

How much performance improvement does local merge bring? This depends on many factors:

  1. Size of the key space in the intermediate result
  2. Key Distribution in space
  3. Number of key-value pairs generated by each mapper

......

The core of local merge performance improvement is to merge key-value pairs with the same key value..

Partial merge is also a solution to reduce failures in mapreduce in some cases. For example, if the word count algorithm does not merge locally, the number of key-value pairs allocated by different reducers may be significantly different (IMAGINE reducer1 is allocated to all ("", 1) while reducer2 is assigned to all ("dog", 1), then the key-Value Pair accepted by reducer1 is much more than reducer2), which may cause the reduce team to lose.

3.1.2 correctness of the algorithm when local merge is used

Two constraints on combiner

For combiner:

  1. Because the execution of this algorithm is undertaken by the mapreduce execution framework, to ensure correctness of the algorithm, you must ensure that the algorithm result is irrelevant to the combiner (that is, whether or not the combiner is executed, the intermediate results generated after execution do not affect the final results of the algorithm)
  2. In mapreduce, mapper input data is the CER input data, so the two data formats are consistent. Conbiner must follow this premise (that is, the input and output data of combiner must be in mapper output format, that is, reducer input format ).

An average Algorithm

First, consider a simple example: now we need to process a bunch of key-value pairs with the <string, integer> type to output the average value corresponding to each different key value. To make it easier to understand, we use a realistic application to explain: now we need to analyze the user access log information of a large web site. The log is a (User ID, duration) each key-Value Pair records a user's access to the website. The user ID indicates the user's ID, and the duration is the session) duration. You need to calculate the average session duration of each user accessing the Website Based on the log information.

Figure 3.4 shows the basic algorithm for solving this problem.

Figure 3.4 A simple mapreduce Algorithm for calculating the average value of each key

This algorithm does not include the combiner. Mapper is a simple equivalent function that directly transmits (T,R. Reducer (T,[R1,R2...]) calculate eachTAverage ValueRValue.

The algorithm is obviously correct and available, but it has the same disadvantage as the basic word counting algorithm in 3.1: It is very inefficient to transmit all key-value pairs generated by Mapper to Cer CER over the network. Unlike word count, reduce cannot be used as a combiner. If you don't believe it, let's try this: combiner calculates the average value of a local dataset and enters it as an intermediate result to Cer. reducer calculates the final average value based on this result. An Inverse example constructed based on this rule is enough to prove that this practice is wrong:

Mean (1, 2, 3, 4, 5) =mean (mean (1, 2), mean (3, 4, 5 ))

Average algorithm: Incorrect combiner Application

So how should we apply combiner in this algorithm? We made an attempt. Figure 3.5 shows a solution with correct computing results but contrary to the mapreduce program design principles.

Figure 3.5 a calculation result is correct but does not match

Mapreduce Program Design Principles

By separate the sum and count, the algorithm shown in Figure 3.5 can calculate eachTAverage ValueRValue. This is our first useCustom Data Type. In the previous calculation, our keys and values are basic data types. By using custom data types, we can construct more complex data and organize the data required for computing together.

But this algorithm is incorrect because it violatesCombiner design principles (mapper output format is consistent with CER input format. The input and output data of combiner must be in mapper output format, that is, reducer input format).

Again, the combiner execution is controlled by the mapreduce execution framework.. In the example shown in Figure 3.5, combiner does not execute or sends reducer to execute two completely different data types (T,[R1,R2...]), at execution time (T,[(S1,C1 ),(S2,C2)...]). If the combiner is sent to the combiner more than once (T,[R1,R2...]) type of data, (T,[(S1,C1 ),(S2,C2)...]). In short,The input and output data types of combiner are inconsistent, which is the error of this algorithm..

Averaging algorithm: Correction for combiner

Figure 3.6 shows an average algorithm that is improved to address problems in the algorithm shown in Figure 3.5.

Figure 3.6 calculate the mean algorithm after correction for the problems in the algorithms shown in Figure 3.5

By changing the Mapper output type, the Mapper output type, combiner input/output type, and CER input type are consistent. Therefore, the algorithm shown in Figure 3.6 is a correct and available mapreduce algorithm.

Merge using mapper

Figure 3.7 shows another algorithm for calculating the mean after local merge optimization, which uses mapper merge.

Figure 3.7 calculate the average using the Mapper merge optimization algorithm

The specific method is to maintain two associated arrays in the Mapper object and associate the arrays.SRecord and,TRecord count. After the Mapper processes all the dataTValue, output (T,(S{T},C{T.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.