Data-intensive Text Processing with mapreduce Chapter 3 (3)-mapreduce algorithm design-3.2 pairs (pairs) and stripes (stripes)

Source: Internet
Author: User
Tags emit

3.2 pairs (pair) and stripes (stripe)

 

A common practice of synchronization in mapreduce programs is to adapt data to the execution framework by building complex keys and values. We have covered this technology in the previous chapter, that is, "package" the total number and count into a composite value (for example, pair) from Mapper to combiner and then to Cer. Based on previous publications (54,94), this section describes two common design patterns called pairs (pair) and strips (stripe ).

 

As an example of runtime, we focus on building a word co-occurrence matrix on large data. This is a common task of corpus linguistics and natural language processing. Formally speaking, the co-occurrence matrix in the corpus is a matrix of n × n Based on N different words (I .e. vocabulary) in the corpus. A mij contains the number of times that the word WI and WJ appear in a specific context (such as sentences, paragraphs, documents, or m words in certain windows, where M words are attributes of application dependencies. The upper and lower triangles of a matrix are the same because they are symmetric, although the relationship between words generally does not have to be called. For example, a co-occurrence matrix m, where mij is the number of co-occurrence times of word I and word J, is usually not balanced.

 

This task is common in Text Processing and providing initial data for other algorithms. For example, statistics on point-to-point information interaction and unsupervised data aggregation, most of the work of dictionary semantics is a word-based distributed scenario model that traces back to Firth [55] and Harris [69] in 1950 and 1960. This task can also be applied to Information Retrieval (for example, synonym dictionary construction and filling), and other related fields such as text mining. More importantly, these problems represent a special example of a task that distributes unrelated joint events from a large number of observations. A common task of statistical natural language processing is a mapreduce solution. In fact, the concept presented here is also used in Chapter 6 to discuss the maximum Expectation Algorithm.

 

In addition to text processing, problems in many application fields have the same characteristics. For example, a large retailer analyzes the transaction records of a sales point to identify the relationship between the purchased products (for example, if customers buy this product, they want to buy it ), this helps inventory management and product placement on shelves. Similarly, an intelligent analysis is intended to distinguish repeated financial transactions, which will provide clues for malicious trading. The algorithms discussed in this section can solve similar problems.

 

Obviously, the algorithm complexity of the word co-occurrence problem is O (n2), where N is the dictionary, which is large and small. In reality, more than 0.1 million English words can be added up, the number of web servers reaches 1 billion. If you place the entire word co-occurrence matrix in the memory, it is very easy to calculate this matrix. However, because this matrix is too large, the memory cannot be placed, memory is saved to the disk. Although the compressed count can increase the size of a single machine to build the word co-occurrence matrix, it is obviously limited in scalability. We will provide two mapreduce algorithms for this task to apply to large-scale datasets.

 

1: Class mapper

2: Method map (docid A, Doc D)

3: For all term wε Doc d do

4: For all term uε neighbors (w) Do

5: emit (pair (W, u), count 1) // count the number of messages sent once

 

1: Class CER

2: Method reduce (pair P, counts [C1, C2,...])

3: s limit 0

4: For all count C 2 counts [C1, C2,...] Do

5: s ← S + C // count the number of occurrences

6: emit (pair P, Count S)

Figure 3.8: pseudocode used to calculate the co-occurrence matrix of words in a big data set

 

Figure 3.8 shows the pseudocode of the first algorithm called "pairs. As usual, the Document ID and related content constitute the Input key-value pair. Mapper processes each input document and sends the same word pair as key 1 (count) as the intermediate key-value pair of the value. This is done by two nested loops: the outer loop traverses each word (the left element in pair), and the inner loop traverses all the adjacent words of the first word (the right element in pair. The mapreduce execution framework ensures that all values of the same key are integrated in cer Cer. Therefore, in this case, the reducer simply uses the same word as the current key-value pair to obtain the absolute number of joint events in the document for simple statistics. These values will be sent as the final key-value pair. Each key-value pair is equivalent to a value in the word co-occurrence matrix. This algorithm illustrates how to use complex keys to coordinate distributed computing.

 

1: Class mapper

2: Method map (docid A, Doc D)

3: For all term wε Doc d do

4: H Branch New associativearray

5: For all term uε neighbors (w) Do

6: H {u} ← H {u} + 1 // count the number of words that appear at the same time as W.

7: emit (term W, stripe H)

 

 

1: Class CER

2: Method reduce (term W, stripes [H1, H2, H3,...])

3: HF merge new associativearray

4: For all stripe hε stripes [H1, H2, H3,...] Do

5: sum (HF, h) // calculate element-wise sum by element

6: emit (term W, stripe HF)

Figure 3.9: Use the stripes method to calculate the word co-occurrence matrix. Note: Element wise is an operation based on elements, which multiply the corresponding elements in two different matrices.

Figure 3.9 shows another optional method, "Stripes. Like pairs, key-value pairs of the same words are generated by two nested loops. However, the main difference from the previous method is that the same-occurrence information is first stored in the associated array H rather than sending an intermediate key-value pair for each same-occurrence word pair. Mapper uses words as keys and sends the corresponding correlated array as values. Each correlated Array records the Adjacent Elements of a word (for example, the words appearing in its context). The mapreduce execution Framework processes all associated arrays with the same key together in the reduce stage. CER performs element-wise sum based on the same key. The accumulated count is equivalent to the same cell in the current matrix ). The final correlated array is sent with the same word as the primary key. Compared with the pairs method, each final key-value pair in the stripes method contains a row in the same matrix.

 

Obviously, the pairs algorithm generates many key-value pairs compared with the stripes algorithm. Stripes is more compact, because the left element in the pairs algorithm represents each co-occurrence word pair. The stripes method generates fewer and shorter intermediate keys. Therefore, you do not need to perform too many sorting tasks in the framework. However, the value of Stripes is more complex and has more serialization and deserialization operations than the pairs algorithm.

 

Both algorithms benefit from the use of combiners, because they run in program compute CERs (the number of additional joined arrays associated with element intelligence) are interchangeable and can be combined. However, combiners In the stripes method has more opportunities to execute partial aggregation, because it mainly occupies space in the dictionary, and the associated array can be updated when mapper encounters a word multiple times. In contrast, pairs primarily occupies the space in which it communicates with the dictionary, A mapper can only count multiple times when observing the same current pair (it is different from watching a word multiple times in stripes ).

 

For these two algorithms, the in-mapper combining optimization method mentioned in the previous chapter can also be used, because this modification is relatively simple and we leave it to the readers as exercises. However, the warnings mentioned above are still as follows: the pairs method will have little chance of partial aggregation due to the lack of a Medium-value storage space. The lack of space also limits the efficiency of In-mapper combining, because the Mapper may have used up the memory before all documents are processed, in this way, key-value pairs must be periodically sent (more restrictions are imposed on the chance of partial aggregation ). Similarly, for the stripes method, memory management is more complex for simple word statistics examples. For common words, the associated array will become very large and the data in the memory needs to be cleared periodically.

 

Considering the potential scalability bottleneck of each algorithm, it is important. The stripes method assumes that every associated array should be small enough to make it suitable for memory at any time-otherwise, the paging of memory will significantly affect the performance. The size of the correlated array is limited by the dictionary size, but the dictionary size is irrelevant to the document size (recall the memory insufficiency discussed previously ). Therefore, when the document size increases, this will become an urgent issue-there may be no GB-level data, however, it is certain that common TB and Pb-level data will be encountered in the future. The pairs method, on the other hand, does not have this restriction because it does not need to store intermediate data in the memory.

 

In view of this discussion, is that method faster? Here we reference the published results [94] to answer this question. We have implemented these two algorithms in hadoop and applied them to a total of 2.27 GB of documents consisting of million documents in the worldstream column (APW) of the Associated Press. Before Running in hadoop, the document set must be processed as follows: Remove all XML tags, and use the basic tools provided by the Lucene search engine to perform word segmentation and remove stop words. And all tokens are replaced by a unique integer. Figure 3.10 compares the different scores of pairs and stripes when running in the same document set. This experiment is executed in a hadoop cluster with 19 nodes, each with a dual-core processor and two disks.

 

The results show that the stripes method is much faster than the pairs method: 666 seconds (11 minutes) and 3758 seconds (62 minutes) are used to process GB of data respectively ). In the pairs method, mappers generates 2.6 billion intermediate key-value pairs totaling 31.2gb. After combiners processing, the number of key-value pairs is reduced to 1.1 billion, which determines the data volume of intermediate data to be transmitted over the network. Finally, ipvcers sends a total of 0.142 billion final key-value pairs (the number of non-zero values in the same current matrix ). In another method, mappers In the pairs method generates 0.653 billion intermediate key-value pairs totaling 48.1gb. After combiners processing, there are only 0.0288 billion key-value pairs left, and finally the merge CERs sends a total of 1.69 million final key-value pairs (the number of rows in the same matrix ). As we expect, the stripes method provides more opportunities for combiners to aggregate intermediate results, greatly reducing network transmission during cleaning and sorting. Figure 3.1.0 also shows the high scalability of the two algorithms-linear number of input data. This is determined by the linear regression of running time, And the generated R2 value is close to 1.

 

Figure 3.10: the time used by pairs and stripes algorithms to calculate the co-occurrence matrix of words using APW collections of different percentages as experimental data, the environment of this experiment is a hadoop cluster with 19 sub-nodes. Each sub-node has two processors and two hard disks.

 

Figure 3.11: (left) Use a hadoop cluster composed of EC2 servers of different sizes to test the execution time of the stripes algorithm in the APW collection. (Right) scale features based on the increase in the size of the hadoop cluster (the relative running speed is improved ).

 

An additional series of experiments explored the scalability of the stripes method: the number of clusters. This experiment can be done with Amazon EC2, which allows users to quickly provide clustering. The virtualization computing unit in EC2 is called an instance. Users can pay the fee based on the instance usage time. Figure 3.11 (left) shows the time of the stripes algorithm (same dataset as the previous setting), on different clusters, from the "small" instance of 20 nodes to the instance of 80 nodes (along the X coordinate ). The running time is indicated by a solid square. Figure 3.11 (right) restructures the same results to demonstrate scalability. The circle shows the size and growth of the scale in the EC2 experiment, about the cluster of 20 nodes. This result shows an ideal linear scale feature (doubling the number of clusters to double the task time ). This is determined by the R2 value approaching 1 in linear regression.

 

In the abstract aspect, pairs and stripes algorithms represent two different methods for calculating the event reproduction in a large number of observations. These two algorithms capture the characteristics of many algorithms, including text processing, data mining, and analysis of complex biological data. For this reason, these two design patterns can be widely and frequently used in different programs.

 

In general, the pairs method records each co-occurrence event, and the stripes method records all the adjustment events that focus on the reproduction time. We split the entire dictionary into B parts (that is, querying by hash). The co-occurrence word of WI is divided into B small "sub-stripes ", separated from 10 different keys (WI; 1), (WI; 2 )... (WI; B ). This is a reasonable way to address the memory limit in the stripes method, because each sub-stripes will be smaller. For B = | v |, | v | indicates the dictionary size, which is the same as that of the pairs method. For B = 1, this is the same as the standard stripes method.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.