Data-intensive Text Processing with mapreduce chapter 3rd: mapreduce Algorithm Design (2)

Source: Internet
Author: User

Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html

3.2 pairs (pairs) and bands (stripes)

This section describes pair and stripe using common examples in a natural language processing (NLP ).Data Organization Mode. This example calculates the co-occurance matrix.

Co-occurrence matrix

In NLP, the co-occurrence matrix isN×NSquare matrix,NIs the number of words in the corpus to be processed (Different words). Matrix ElementMijThe value of represents a word.Wi,WjThe number of co-occurrence (co-occurrence) times.Wi,WjCo-occurrence is definedWi,WjIt appears in the specified context range at the same time. The context scope can be defined in various ways, such as the same sentence, the same paragraph, the same document, or the same continuousKA sequence composed of words (K). Because the co-occurrence relationship is symmetricMThe upper triangle array of is the same as that of the lower triangle (dividing line: lower left → upper right ).

The calculation of the co-occurrence matrix is common in NLP. In addition, there are many other applications:

  1. Text Mining
  2. Combined event mining (such as supermarket shopping, customers who buy a usually buy B, then goods A and B can be placed close to each other)
  3. Suspicious behavior detection (discovering the connection between a combination of events and abnormal events)

......

The space overhead of the co-occurrence matrix is obviouslyO(N2), whereNIs the size of the vocabulary (Vocabulary)Different words). Based on different datasets, there may be a gap in the vocabulary size: The English Corpus may contain tens of thousands of words, while the Web corpus may have billions of words. If the vocabulary is not large, the co-occurrence matrix can be put into the single-host memory for processing, of course, the best. However, large corpus data often has a large vocabulary, so that the co-occurrence matrix is too large to be stored in the memory. The use of virtual storage will reduce the algorithm performance, and even if compression can reduce the space overhead of the co-occurrence matrix, this processing mode based on standalone and primary storage will eventually have an upper limit. Next we will introduce two mapreduce-based algorithms which are highly scalable and can process large-scale corpus data.

Mapreduce-based co-occurrence matrix computing

Figure 3.8 shows the first pair-based mapreduce co-occurrence matrix calculation algorithm (hereinafter referred to as the pair algorithm ).

Figure 3.8 pair-based mapreduce co-occurrence matrix calculation algorithm

Mapper accepts (docid, DOC) pairs as input (W,U) Generates an output key-Value Pair (see Figure 3.8 mapper 5th ). Here we use a very intuitive Layer 2 loop: The first loop traverses all words, and the second loop traverses all words adjacent to the current word (co-occurrence. If you think of a co-occurrence matrix as a graph, this is equivalent to a count on one side of each output graph.

CER ((W,U), 1) Add to get the final ((W,U),S) Set. Each element in the set corresponds to an element in the co-occurrence matrix. Therefore, this set is equivalent to the co-occurrence matrix.

Figure 3.9 shows a stripe-based mapreduce co-occurrence matrix computing algorithm (hereinafter referred to as the stripe algorithm ).

Figure 3.9 stripe-based mapreduce co-occurrence matrix calculation algorithm

The process for er to construct the co-occurrence pair of the current document is similar to the algorithm shown in Figure 3.8. It also uses a layer-2 loop. However, the er record and output data methods have changed, for each word/term (TERM) in the current document)WMaintain an associated arrayH, MakingH{U} Records the co-occurrence pair (W,U. The reduce operation is generated by mapper (W,H) To add and merge, and finally get (W,H), For each element,WIs a word/term, H stores all andWCommon WordsUAnd co-occurrence (W,U. This set is also equivalent to a co-occurrence matrix.

Comparison between the pair algorithm and the stripe Algorithm

In comparison, the Data Representation of stripe-based algorithms is more compact, while pair-based algorithms generate much more intermediate key-value pairs than the stripe-based algorithms. Therefore, the pair-based algorithm requires more elements to be sorted during execution.

Both algorithms can use combiner. for the stripe algorithm, the application of the combiner algorithm is very simple and efficient. You only need to merge the associated arrays, the number of merged steps does not exceed the size of the corpus data Vocabulary (because the number of associated array elements to be merged does not exceed the size of the corpus data vocabulary ). The merging workload of the pair algorithm is much larger, and it can only merge those with the same (W,U) Value ((W,U), 1) Right, and (W,U.

The two algorithms can also be merged using Mapper. The specific practices are left for readers to think about. No matter what the practice is, one thing is the same as using combiner: the key-Value Pair generated by the pair algorithm is characterized by sparse distribution in the value range of the key, therefore, even if the application is partially merged, there are few items that can be truly merged. And since (W,U.

The scalability of these two algorithms is equally important. The premise for the stripe algorithm to work is that for each mapper input (docid, DOC) pair), the associated array corresponding to it can be put into the memory, otherwise, the introduction of virtual storage will greatly reduce the performance. Therefore, stripe is hard to be competent in terms of a large vocabulary (generally, several GB of corpus data is fine, but TB or even Pb-level data is hard to say ). The pair algorithm does not have this problem.

Which algorithm has better performance? In this regard, we provide some previous experiment results. We have implemented the above two algorithms on hadoop and used them to process 2.27 million documents (5.7 GB in total) from APW (Associated Press worldstream ). Before using hadoop for processing, we first perform some preprocessing: Remove XML tags, and use Lucene to number words (tokenization ), in this way, the document data is converted to a pure digital set (for processing convenience ). All these experiments are performed on a hadoop cluster with 19 slave instances. Each slave instance is equipped with two single-core CPUs and two disks. Figure 3.10 shows the running time of the pair algorithm and the stripe algorithm respectively.

Figure 3.10 running time of pair and stripe Algorithms

(I converted the comparison data described in the original text into a table)

T

IP

IPV

PCIP

Fp

Stripe

666sec (~ 11 min)

653 m

48.1 GB

28.3 m

1.69 m

Pair

3758sec (~ 62 min)

(5.7 × slower)

2.6b

31.2 GB

1.1b

142 m

Column description:

  1. T =TTime consumed by IME Algorithms
  2. IP =INtermediate key-ValuePKey-value logarithm of the intermediate airs result (mapper output)
  3. IPV =INtermediate key-ValuePAirsVSpace occupied by key-value pairs in olume intermediate results
  4. PCIP =POst-COmbinerINtermediate key-ValuePThe key-value logarithm of the remaining intermediate results after the airs is merged by combiner (combiner output)
  5. Fp =FInal key-ValuePThe final key-value logarithm of airs (reducer output)

Number/quantizer description:

  1. SEC = Second seconds
  2. Min = minute minutes
  3. B = billion
  4. M = million
  5. GB =? You know

From the comparison data, we can see that the performance of the stripe algorithm is better than that of the pair algorithm in the experimental dataset. Both algorithms have ideal algorithm complexity (linear): Calculated using linear regression.R2 values are close to 1.

Another indicator is the scalability of algorithms. We have measured the computing performance of the stripe Algorithm in clusters of different sizes. The experiment was conducted using the Amazon EC2 service. Experiment Data 3.11 is shown.

Figure 3.11 scalability test of the stripe algorithm under cluster Scale change

(The left is the computing time Data graph, and the right is the acceleration Ratio Data graph)

The left figure shows the computing time of the stripe algorithm in the cluster size between 20-80 (increased by 10 each time), and the right figure shows the acceleration ratio data calculated based on time data. It can be seen that the stripe algorithm has good scalability (linear), and linear regression is calculated based on the acceleration ratio data.RThe value 2 is close to 1.

Summary

The pair algorithm and the stripe algorithm represent two methods for discovering and counting co-occurrence events from the observation set (in this example, constructing a co-occurrence matrix from the corpus. The idea of these two algorithms can be effectively applied in many types of problems (such as text processing, data mining, and biological information computing ).

Let's take a further look,The pair algorithm is consistent with the stripe algorithm.. The granularity of their records is different: the pair algorithm records each co-occurrence separately, while the stripe algorithm records the co-occurrence that meets certain conditions (this condition is that it has the same key value ). In the stripe algorithm, we can divide the vocabulary of corpus dataBBuckets (such as through hash), so that the original stripe will be dividedBSub-stripe. This method can solve the problem of insufficient memory of the stripe algorithm when the vocabulary is large. Note that whenB= 1, which is the standard stripe algorithm.B= | V | (where | v | is the number of words in the vocabulary), the stripe algorithm is equivalent to the pair algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.