Data-intensive Text Processing with mapreduce Chapter 3 (4)-mapreduce algorithm design-3.3 calculation relative frequency

Source: Internet
Author: User

3.3 calculation relative frequency

 

Let's talk about pairs and stripes.AlgorithmAnd then build a reproduction matrix m on a large dataset. Recall that in this large n × n matrix, when n = | v | (Dictionary size), the element mij contains the number of times that word WI and WJ appear together in a specific context. The disadvantage of the unrestricted count is that it does not take into account the fact that some words will appear more frequently than other words. Word wi may appear more frequently than WJ because one of them may be commonly used words. A simple solution is to change the unrestricted count to the relative frequency f (WJ | WI ). How can we calculate the WJ frequency in a wi context? You can use the following formula for calculation:

 

F (WJ jwi) = N (WI; WJ)/Σ w'n (WI, w0) (3.1)

 

Here, n (.,.) mark the number of times a special same occurrence object is observed in the dataset. We need to count the joint event (word co-occurrence ), split it with the border we know (the total number of the control variables and other variables in the same count.

 

The stripes method can be used to directly calculate the correlation frequency. In CER, the number of words that appear together with the control variable (WI in the preceding example) is used in the associated array. Therefore, the sum of these numbers can be calculated to reach the boundary (that is, Σ W0 N (WI; w0), and then the boundary value is used to divide all joint events to obtain the Correlation Frequency of all words. This implementation must make minor modifications to the algorithm shown in Figure 3.9 and describe how to use the complex data structure in the mapreduce distributed computing environment. Although one of the right keys and values can use the mapreduce execution framework to centralize the data to be computed. You must know that, like before, this algorithm also assumes that each associated array is stored in the memory.

 

So how can we use the pairs method to calculate the relative frequency? In the pairs method, the reducer uses the received (WI, Wj) as the key count as the value. F (WJ | WI) cannot be calculated with this alone because we do not have a boundary value. Fortunately, like Mapper, reducer can save the state by multiple keys. In reducer, We Can cache all words that appear simultaneously with Wi to the memory. In fact, we use the stripes method to create an associated array. To make it feasible, we also need to define the order of pair so that keys can be sorted first by the Left words and then by the right words. With this sort, we can easily find out whether all pairs related to words that we use (WI) as conditions appear. At this time, we can calculate the relative frequency by storing data in the memory, and then send the result in the final key-value pair.

 

You also need to modify a place to make the algorithm work. We must ensure that all pairs with the same words on the left are sent to the same CER Cer. Unfortunately, it does not happen automatically: Recall that the default partitioner is based on the hash value of the intermediate key, modeled by the number of merge CERs. For a complex key, it is used to calculate the hash value. As a result, it cannot be guaranteed. For example, (dog, javasdvark) and (dog, Zebra) are specified to the same CER Cer. To generate the expected behavior, we must define a custom partitioner to focus only on the words on the left. That is to say, partitioner should only be split Based on the hash algorithm of the words on the left.

 

This algorithm can work, but it has the same disadvantage as the stripes method: when the number of document sets increases, the dictionary size also increases, in some cases, there may be insufficient memory to store the number of all co-occurrence words and the number of words we monitor. To calculate the co-occurrence matrix, the advantage of using the pairs method is that it will not encounter memory bottlenecks. Is there a way to modify the pairs method to retain this advantage?

 

Later, people found that such an algorithm is indeed feasible, although it requires some coordination mechanisms in mapreduce. Is the sequence of data arriving at Cer CER correct. If it is possible to make the boundary value calculated (or used) in cer CER in some way before processing the joint count, reducer can simply split the joint count by the boundary value to calculate the relative frequency. Notifications of "before" and "after" can be captured in the order of key-value pairs, which can be explicitly controlled inProgramPersonnel. Programmers can define the sort order of keys to process the data that needs to be processed before the CER Cer. However, we still need to calculate the boundary count. Recall that in the basic pairs algorithm, each mapper sends a key-value pair with the same word as the key. To calculate the correlation frequency, we modified mapper so that it sends out a key-value pair with the form (WI, *) as the "special" key and the value of 1, this represents the contribution of words to the boundary value. By using combiners, these boundary counts are aggregated before being sent to the reducer. As an option, the in-mapper combining mode can more effectively aggregate the boundary count.

 

In CER, we must ensure that some boundary values of special key-value pairs are executed before common key-value pairs represent the number of joint times. This is done by defining the sort order of keys and then including the pairs of the special sign of the form (WI; *)

It is placed before any other key-value pair whose wi is the left word. In addition, as before, we must define a partitioner to observe the left word in each pair. With proper data sorting, reducer can directly calculate the relative frequency.

 

Figure 3.12 is a real example listing the key-value pairs that CER will encounter. First, the reducer faces (dog, *) as a special key and multiple values. Each value represents the map stage (Here we assume that no matter whether combiners or in-mapper combining, these values represent the partial aggregation count.) The key-value pairs contributed by some boundaries are obtained. CER accumulates these counts to reach the limit of Σ w'n' (dog, W '). CER saves these values to calculate the sub-sequence key. Starting from (dog, *), CER will encounter a series of keys that represent the joint count; the first key is (dog, 1_dvark ). Related to this key is part of the joint count generated in the map stage (here there are two different values ). Counting these counts can result in the final joint count, that is, the number of times that dog and mongodvark appear in the entire set simultaneously. At this point, because CER already knows the limit value, the related frequency can be calculated through simple calculation. All sub-joint counts are processed in the same way. When CER encounters the next special key-Value Pair (Doge; *), reducer resets its internal status and recalculates the limit value for the next round, because only the limit value (an integer) storage required. We do not need to cache the number of separate co-occurrence words. Therefore, we have eliminated the scalability bottleneck of the previous algorithm.

 

Key values

(Dog, *) [6327,851 4,...] Total number of computations: Σ w'n' (dog, W') = 42908

(Dog, mongodvark) [3/42908] F (mongodvark | dog) =

(Dog, mongodwolf) [1] F (mongodwolf | dog) = 1/42908

...

(Dog, Zebra) [2, 1, 1, 1] F (zebra | dog) = 5/42908

(Doge, *) [682,...] Total number of computations: Σ W0 N (Doge, W') = 1267

...

Figure 3.12: a series of key-value pairs sent to Cer CER in the pairs algorithm that calculates relative frequencies. It illustrates the reverse order mode.

 

This design pattern is called "Reverse Order" and is often used across many fields. The reason for this naming is that, through proper coordination, we can access the computing results in cer CER before processing the data to be calculated (for example, a clustering statistics ). The main problem is to convert a computing queue into a sort order. In most cases, a specific algorithm usually requires a fixed data sorting method: By controlling key sorting and key space allocation, we can sort the data to be calculated in a specific environment and then send it to Cer. This greatly reduces the number of results that the reducer needs to save in the memory.

 

To sum up, special programs that use the reverse order mode to calculate the frequency need the following:

 

1. Send a special key-value pair for each co-occurrence word pair in Mapper to capture its contribution to the boundary value.

2. Control the sorting order of intermediate keys, so that special key-value pairs indicate that the boundary contribution value is executed before any key-Value Pair indicates the number of joint words with the same occurrence

3. Define a custom partitioner to ensure that all pair with the same left word are uploaded to the same CER Cer.

4. Multiple keys in cer CER are used to save the status. First, the boundary value based on the special key-value pair is calculated, and then the boundary value is used to split the joint count to get the relevant frequency.

As we will see in chapter 4, this design pattern can also be used to build reverse indexes by appropriately setting Compression Parameters for the link list.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.