Data-intensive Text Processing with mapreduce Chapter 3 (5)-mapreduce algorithm design-3.4 secondary sorting

Source: Internet
Author: User

3.4 secondary sorting

In the shuffle and sort phases, mapreduce uses keys to sort intermediate key-value pairs, if the computation in cer CER depends on the sorting order, it is very simple (that is, the Order reversal mode mentioned in the previous chapter ). However, what if we need to sort values in addition to using keys? Google's mapreduce implementation provides a built-in secondary sorting mechanism, which ensures that values arrive in order of sorting. Unfortunately, hadoop does not have a built-in mechanism.

(T1, M1, r80521)

(T1, M2, r14209)

(T1, M3, r76042)

...

(T2, M1, r21823)

(T2, M2, r66508)

(T2, M3, r98347)

Consider the data example of the next scientific experiment sensor: There are m sensors, each reading continuously, m may be a large number. The export data of a sensor is like this. The RX after each timestamp represents the actual sensor reading (not important in this discussion, but may be a series of values, one or more complex records, or even the byte streams of image files ).

Suppose we want to reconstruct the activity of each sensor. MapreduceProgramTo complete the process, collect raw data, and use the Sensor ID as the intermediate key,

M1 → (T1, r80521)

In this way, All readings of the same sensor can be transmitted to Cer CER together. However, because mapreduce does not guarantee sorting of different values of the same key, sensor readings may not be sorted in the predetermined order. The easiest solution is to cache these readings and sort them by timestamps before processing the data. However, it is worth noting that any practices of caching data in the memory will bring potential scalability bottlenecks. What if we need to handle high-reading-frequency sensors or sensors that have been running for a long time? What if the sensor reading itself is a large and complex object? This method is not applicable in this case-reducer may use up memory because it caches all values of the same key.

This is a common problem, because in many applications, we want to first group data by certain conditions (for example, by the Sensor ID, then, sort by another condition (for example, by time) in the grouping process. Fortunately, there is a common solution called "value-to-key conversion. The basic idea is to combine the score and the intermediate key into a hybrid key for mapreduce to process the sorting. In the above example, we use the ID and timestamp of the sent sensor as a hybrid key instead of the Sensor ID as the key:

(M1, T1) → (r80521)

Now the sensor reading is a value. We must define the intermediate key sorting order to first sort by the Sensor ID (the left element of pair) and then by the timestamp (the right element of pair. We also need to implement a custom partitioner to transfer all pairs of the same sensor to the same CER Cer.

After proper sorting, key-value pairs will arrive at Cer CER in the correct order.

(M1, T1) → [(r80521)]

(M1, T2) → [(r21823)]

(M1, T3) → [(r146925)]

...

However, the sensor readings are now split into multiple keys. Reducer must save the previous state and track the reading of the current sensor at that end, where the next sensor starts.

 

The trade-off between the two methods discussed above (Cache and In-memory sorting vs. "key-value conversion" mode) is that the sorting is executed there. A secondary sorting can be directly implemented in cer CER, which may run faster but may encounter 10 problems with Scalability bottlenecks. In key-value conversion, sorting is not based on the mapreduce framework. You need to know that this method can be expanded to three times, four times or more sorting. The consequence of using this mode is to generate more keys for the mapreduce framework to sort. However, distributed sorting is good at mapreduce, but this method violates the essence of the mapreduce programming model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.