Data-intensive Text Processing with mapreduce chapter 3rd: mapreduce Algorithm Design (4)

Source: Internet
Author: User

Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html

3.4 secondary sorting

Before intermediate results enter CER, mapreduce first sorts these intermediate results and then distributes them. This mechanism is very convenient for reduce operations that depend on the input sequence of intermediate results (in the order of keys. Reverse Order (Order
Inversion) is an example of using this mechanism. But what if there is a further sorting requirement (sort by value based on the above )? Secondary sorting ensures that data arriving at Cer CER is not only sorted by key, but also ordered by value. Google's mapreduce provides built-in support for secondary sorting (optional, that is, secondary sorting is not a required operation), and hadoop is unfortunate (or even miserable) there is no built-in support for secondary sorting, so with this "value-to-key conversion (
Conversion) "mode.

Consider an example of sensor data: ExistingM(MA large number of sensors.MContinuous reading by a sensor (so that each readMGroup data ). The following data is obtained:

WhereTXIs the time point,MXIs the sensor number,RXIs the sensor reading (RX).

If we want to reproduce the working status of each sensor, We need to divide the data according to the sensor. Now we design a mapreduceAlgorithmThe Mapper reads the original data and outputs The following structured data (the first of the above raw data is used as an example ):

(M1 ,(T1,R80521 ))

Obviously, the key-value pair from the same sensor will enter the same CER, and mapreduce can ensure that the key-value pair of the input CER is based on the key (that isMX. However, mapreduce cannot guarantee that each group has the sameMXKey-value pairs of values,TXIt is also ordered (because there is noTXSorting ). The most intuitive and simple way to solve this problem is to put each group of data (such (Mi,
[(Ti1,Ri1 ),(Ti2,Ri2 ),...(Tin,Rin)]) Reducer input data) read into the memory, and then followTXSort. However, this method is limited by the memory capacity and has poor scalability.

Secondary sorting and key-value conversion

The operation to be performed is actually a secondary sorting. The requirements for quadratic sorting exist in many applications: follow a key (in this exampleMX) Sort all the data, and then sort each segment containing the same key value by another key (in this exampleTX). Although hadoop does not provide native secondary sorting support, fortunately, secondary sorting can be achieved through the application of "value-to-Key
Conversion. The basic idea of key-value conversion is to move the key to be sorted in the value to the key. For the above example, you can change the format of the key-Value Pair output by Mapper:

((M1,T1 ),R80521)

The rest is simple:

1. Custom sorting: sort by MX first, then, sort each group by Tx

2. Custom partitioner: divided by MX in the key

 
By using key-value conversion, you can not only sort two keys, but also sort more keys.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.