Data-intensive Text Processing with mapreduce Chapter 3 (3) -- computing Relative Frequencies

Source: Internet
Author: User


3.3 computing Relative Frequencies calculation relative frequency

Let's continue to build a reproduction matrix m on large datasets based on the pairs and stripes algorithms mentioned earlier. Recall in this large n × n matrix model, when n = | v | (Dictionary size), the element mij contains the number of times that word WI and WJ appear together in a specific context. The disadvantage of the unrestricted count is that it does not take into account the fact that some words will appear more frequently than other words. Word wi may appear more frequently than WJ because one of them may be commonly used words. A simple solution is to change the unrestricted count to the relative frequency, F (WJ
| WI ). How can we calculate the WJ frequency in a wi context? You can use the following formula for calculation:

Here, n (.,.) mark the number of times a special same occurrence object is observed in the dataset. We need to count the joint event (word co-occurrence ), split it with the border we know (the total number of the control variables and other variables in the same count.


It is very easy to calculate the Correlation Frequency using the stripes method. In CER, the number of words that appear together with the control variable (WI in the preceding example) can be obtained from the associated array. Therefore, we can calculate the total value of these numbers (the number of all words that appear together with Wi) to reach the boundary (that is, Σ w'n' (WI; W ')), then, the boundary value is used to split all joint events to obtain the Correlation Frequency of all words. This implementation must make minor modifications to the algorithm shown in Figure 3.9 and describe how to use the complex data structure in the mapreduce distributed computing environment. Although appropriate key and value structures can be used, the mapreduce execution framework can be used to centralize the data to be computed. You must know that, like before, this algorithm also assumes that each associated array is stored in the memory.


So how can we use the pairs method to calculate the relative frequency? In the pairs method, the reducer uses the received (WI, Wj) as the key and the count as the value. F (WJ | WI) cannot be calculated with this alone because we do not have a boundary value. Fortunately, like Mapper, reducer can save the state through multiple keys. In reducer, We Can cache all words that appear simultaneously with Wi to the memory. In fact, we use the stripes method to create an associated array. To make it feasible, we also need to define the order of pair so that keys can be sorted first by the Left words and then by the right words. With this sort, we can easily find out whether all pairs related to words that we use (WI) as conditions appear. At this time, we can calculate the relative frequency by storing data in the memory, and then send the result in the final key-value pair.


You also need to modify a place to make the algorithm work. We must ensure that all pairs with the same words on the left are sent to the same CER Cer. Unfortunately, it does not happen automatically: Recall that the default partitioner is based on the hash value of the intermediate key, modeled by the number of merge CERs. For a complex key, it is used to calculate the hash value. As a result, it cannot be guaranteed. For example, (dog, javasdvark) and (dog, Zebra) are specified to the same CER Cer. To generate the expected behavior, we must customize a partitioner to focus only on the words on the left. That is to say, partitioner should only be split Based on the hash algorithm of the words on the left.


This algorithm can work, but it has the same disadvantage as the stripes method: when the number of document sets increases, the dictionary size also increases, in some cases, there may be insufficient memory to store the number of all co-occurrence words and the number of words we monitor. To calculate the co-occurrence matrix, the advantage of using the pairs method is that it will not encounter memory bottlenecks. Is there a way to modify the pairs method to retain this advantage?


Figure 3.12: a series of key-value pairs sent to Cer CER in the pairs algorithm that calculates relative frequencies. It illustrates the reverse order mode.


Later, people found that such an algorithm is indeed feasible, although it requires some coordination mechanisms in mapreduce. Is the sequence of data arriving at Cer CER correct. If it is possible to make the boundary value calculated (or used) in cer CER in some way before processing the joint count, reducer can simply split the joint count by the boundary value to calculate the relative frequency. Notifications of "before" and "after" can be captured in the order of key-value pairs, which can be clearly controlled by programmers. Programmers can define the sort order of keys to process the data that needs to be processed before the CER Cer. However, we still need to calculate the boundary count. Recall that in the basic pairs algorithm, each mapper sends a key-value pair with the same word as the key. To calculate the correlation frequency, we modified mapper so that it sends out a key-value pair with the form (WI, *) as the "special" key and the value of 1, this represents the contribution of words to the boundary value. By using combiners, these boundary counts are aggregated before being sent to the reducer. Use in-mapper
The combining mode can more effectively aggregate the boundary count.


In CER, we must ensure that some boundary values of special key-value pairs are executed before common key-value pairs represent the number of joint times. This is done by defining the sort order of keys, and then the pairs containing the special signs of the form (WI; *) are placed before any other key-value pairs whose wi is the left word. In addition, as before, we must define a partitioner to observe the left word in each pair. With proper data sorting, reducer can directly calculate the relative frequency.

Figure 3.12 is a real example listing the key-value pairs that CER will encounter. First, the reducer faces (dog, *) as a special key and multiple values. Each value represents the map stage (Here we assume that no matter whether combiners or in-mapper combining, these values represent the partial aggregation count.) The key-value pairs contributed by some boundaries are obtained. CER accumulates these counts to reach the limit of Σ w'n' (dog, W '). CER saves these values to calculate the sub-sequence key. Starting from (dog, *), CER will encounter a series of keys that represent the joint count; the first key is (dog, 1_dvark ). Related to this key is part of the joint count generated in the map stage (here there are two different values ). Counting these counts can result in the final joint count, that is, the number of times that dog and mongodvark appear in the entire set simultaneously. At this point, because CER already knows the limit value, the related frequency can be calculated through simple calculation. All sub-joint counts are processed in the same way. When reducer encounters a special key-Value Pair (Doge;
*), The reducer resets its internal status and begins to recalculate the limit value for the next round, because only the limit value (an integer) needs to be stored. We do not need to cache the number of separate co-occurrence words. Therefore, we have eliminated the scalability bottleneck of the previous algorithm.


This design pattern is called "Reverse Order" and is often used across many fields. The reason for this naming is that, through proper coordination, we can access the computing results in cer CER before processing the data to be calculated (for example, a clustering statistics ). The main problem is to convert a computing queue into a sort order. In most cases, a specific algorithm usually requires a fixed data sorting method: By controlling key sorting and key space allocation, we can sort the data to be calculated in a specific environment and then send it to Cer. This greatly reduces the number of results that the reducer needs to save in the memory.


To sum up, special programs that use the reverse order mode to calculate the frequency need the following:


1. Send a special key-value pair for each co-occurrence word pair in Mapper to capture its contribution to the boundary value.


2. Control the sorting order of intermediate keys, so that special key-value pairs indicate that the boundary contribution value is executed before any key-Value Pair indicates the number of joint words with the same occurrence


3. Define a custom partitioner to ensure that all pair with the same left word are uploaded to the same CER Cer.


4. Multiple keys in cer CER are used to save the status. First, the boundary value based on the special key-value pair is calculated, and then the boundary value is used to split the joint count to get the relevant frequency.


As we will see in chapter 4, this design pattern can also be used to build reverse indexes by appropriately setting Compression Parameters for the link list.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.