Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
3.4 secondary sorting
Before intermediate results enter CER, mapreduce first sorts these intermediate results and then distributes them. This mechanism is very convenient for reduce operations that depend on the input sequence of intermediate results (in the order of keys. Reverse Order (Order
Inversion) is an example of using this mechanism. But what if there is a further sorting requirement (sort by value based on the above )? Secondary sorting ensures that data arriving at Cer CER is not only sorted by key, but also ordered by value. Google's mapreduce provides built-in support for secondary sorting (optional, that is, secondary sorting is not a required operation), and hadoop is unfortunate (or even miserable) there is no built-in support for secondary sorting, so with this "value-to-key conversion (
Conversion) "mode.
Consider an example of sensor data: ExistingM(MA large number of sensors.MContinuous reading by a sensor (so that each readMGroup data ). The following data is obtained:
WhereTXIs the time point,MXIs the sensor number,RXIs the sensor reading (RX).
If we want to reproduce the working status of each sensor, We need to divide the data according to the sensor. Now we design a mapreduceAlgorithmThe Mapper reads the original data and outputs The following structured data (the first of the above raw data is used as an example ):
(M1 ,(T1,R80521 ))
Obviously, the key-value pair from the same sensor will enter the same CER, and mapreduce can ensure that the key-value pair of the input CER is based on the key (that isMX. However, mapreduce cannot guarantee that each group has the sameMXKey-value pairs of values,TXIt is also ordered (because there is noTXSorting ). The most intuitive and simple way to solve this problem is to put each group of data (such (Mi,
[(Ti1,Ri1 ),(Ti2,Ri2 ),...(Tin,Rin)]) Reducer input data) read into the memory, and then followTXSort. However, this method is limited by the memory capacity and has poor scalability.
Secondary sorting and key-value conversion
The operation to be performed is actually a secondary sorting. The requirements for quadratic sorting exist in many applications: follow a key (in this exampleMX) Sort all the data, and then sort each segment containing the same key value by another key (in this exampleTX). Although hadoop does not provide native secondary sorting support, fortunately, secondary sorting can be achieved through the application of "value-to-Key
Conversion. The basic idea of key-value conversion is to move the key to be sorted in the value to the key. For the above example, you can change the format of the key-Value Pair output by Mapper:
((M1,T1 ),R80521)
The rest is simple:
1. Custom sorting: sort by MX first, then, sort each group by Tx
2. Custom partitioner: divided by MX in the key
By using key-value conversion, you can not only sort two keys, but also sort more keys.