The Rawcomparator and realization of Hadoop-2.4.1 learning

Source: Internet
Author: User

Hadoop supports the direct comparison of serialized binary streams, which is obviously more efficient than the serialized binary stream deserialization object in comparison with the former. It is necessary to compare binary streams because interprocess communication on multiple nodes of Hadoop is implemented through remote procedure calls (Remoteprocedure call Protocol,rpc), and the RPC protocol serializes the message to a binary stream before it is sent to the remote node. When a remote node receives a binary stream and then deserializes it back into the original message, it is more efficient if a direct comparison of the serialized binary stream is made. This article will learn about the byte-based comparator rawcomparator and its implementation in Hadoop.

The relationship between the Rawcomparator interface and its implementation class is as shown. The Rawcomparator interface inherits from Java.util.comparator<t>, so in addition to providing a method based on byte comparisons, the interface provides methods for object comparisons. In fact, in many implementation classes of Rawcomparator, the method of object comparison is eventually called in the method of byte-based comparison. Here is a brief introduction to the interface and its implementation.


Rawcomparator: An interface for comparing the byte representation of an object, the only method of which is int compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2), where B1 is the byte array where the first object resides , S1 is the starting position of the object in B1, L1 is the length of the object in B1, B2 is the byte array in which the first object is located, S2 is the starting position of the object in B2, and L2 is the length of the object in B2.

Deserializercomparator<t>: Abstract class, which implements compare in Rawcomparator (byte[]b1, int s1, int L1, byte[] b2, int s2, int l2), But compare (t O1, T O2) is not implemented. In this class, use Deserializer to deserialize the object that you want to compare, and then use the standard java.util. The Compare method in the comparator interface compares the deserialized object. Javaserializationcomparator directly inherits from Deserializercomparator, and uses javaserialization.javaserializationdeserializer< T> deserializes the object to be compared. These two classes are rarely used in Hadoop because the Java itself has a serialization mechanism that is not as efficient as the serialization mechanism implemented by Hadoop.

Writablecomparator: This class directly implements the Rawcomparator interface, which is used to compare objects that implement the Writablecomparables interface. The basic implementation is compared in natural order, if you want to compare in other order, such as reverse, you can inherit the class and overwrite compare (writablecomparable,writablecomparable) to achieve a custom comparator. This is because the compare (writablecomparable,writablecomparable) method is eventually called in the Compare method for the byte, so the method determines how to compare. You can optimize the comparison operation by overriding compare (Byte[],int,int,byte[],int,int), which provides a number of static methods to help optimize the implementation of the method.

Keyfieldbasedcomparator: This class inherits from Writablecomparator and implements the configurable interface. This class provides some of the attributes of Unix/gnu ordering, which are:-N (numeric sort),-R (reverse the comparison), and so on.

Next, learn how these two Rawcomparator implementation classes implement byte-based comparisons using the source code of Deserializercomparator and Writablecomparator. The first is the Deserializercomparator class, the implementation code for this class is as follows:

@Overridepublic int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {try {  //inputbuffer buffer, reset buffer Data read      Buffer.reset (B1, S1, L1);  T key1,t Key2. deserializer<t> Deserializer  //Deserializes the next object from the underlying input stream. If the parameter t of the method is not NULL  //Then the Deserializer may set its internal state to the next object read from the input stream  //If T is null, the new Deserializer object will be created      Key1 = Deserializer.deserialize (key1);      Buffer.reset (B2, S2, L2);      Key2 = Deserializer.deserialize (Key2);    } catch (IOException e) {      throw new RuntimeException (e);} Compare the deserialized object to    return compare (Key1, key2);}

In the preceding code, use Deserializer to deserialize the bytes in the byte[] B1 to the object, and then compare the objects. Specify Deserializer as Javaserialization.javaserializationdeserializer<t> in the unique subclass Javaserializationcomparator of the class; The latter implements the Deserializer interface, the source code is as follows:

Private ObjectInputStream ois; @Overridepublic void Open (InputStream in) throws IOException {   // Creates a ObjectInputStream object that deserializes an object from the input stream ois = new ObjectInputStream (in) {        @Override protected void Readstreamheader () {          //No Header        }}   ;} @Override @suppresswarnings ("unchecked") public T Deserialize (T object) throws IOException {   try {      // Deserializes an object from the input stream using the Java deserialization mechanism      return (T) ois.readobject ();   } catch (ClassNotFoundException e) {     throw new IOException (E.tostring ());}   }

The above code shows that Deserializercomparator and its subclasses actually call ObjectInputStream de-serialized objects at the bottom. This requires that the Deserializer deserialized object be serialized by ObjectOutputStream. As mentioned earlier, Hadoop provides a more efficient serialization mechanism than the one that comes with Java itself, so it is not recommended to use Deserializercomparator and its subclasses, and there is no place to use them in the source code of Hadoop.

After reading the Java-based deserialization mechanism of the comparator, next, learn about the widely used writablecomparator and its subclasses in Hadoop. The code for the byte-based comparison method in Writablecomparator is as follows:

The type of key in Hadoop must implement the Writablecomparable interface//The interface inherits from writable and comparableprivate final writablecomparable key1; Private final writablecomparable key2;//reads data from the memory cache Datainput, which inherits from Datainputstreamprivate final Datainputbuffer buffer; @Overridepublic int compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {    try {      Buffer.reset (B1, S1, L1);                   Parse Key1     //Deserializes the field of the Key1 object from buffer, calling buffer      key1.readfields (buffer);      Buffer.reset (B2, S2, L2);                   Parse Key2      key2.readfields (buffer),    } catch (IOException e) {      throw new RuntimeException (e);    }    Return compare (Key1, key2);                   Compare them}

Above the Compare method, essentially called the DataInputStream in the Readxxx method, for example, Key1 for longwritable, the call is Readlong, for Intwritable is readint. However, the subclasses of Writablecomparator are defined within the implementation of many writablecomparable, such as in intwritable, where the definition of the Writablecomparator subclass is as follows. The Writablecomparator subclass in Intwritable overrides the Compare method of the parent class (that is, the previous segment code):

/** A Comparator optimized for intwritable. */public   static class Comparator extends Writablecomparator {public    Comparator () {      super ( Intwritable.class);    }    @Overridepublic int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {  //Writablecomparator static method Readint      int thisvalue = ReadInt (B1, S1);      int thatvalue = readInt (b2, S2);      Return (Thisvalue<thatvalue-1: (Thisvalue==thatvalue-0:1));}  }

The code for the Readint method is as follows:

/** Parse An integer from a byte array. *  /public static int readInt (byte[] bytes, int start) {    //Bytes[start  ]*224 + bytes[start+1]*216+ Bytes[star t+2]*28+ Bytes[start+3]    return (((Bytes[start  & 0xff) <<) + ((Bytes[start+1] & 0xff) << 1 6) +            ((bytes[start+2] & 0xff) <<  8) + ((Bytes[start+3] & 0xff)));  }

The Readint method of DataInputStream in Java is as follows, Datainputbuffer, in the Bytearrayinputstream subclass Datainputbuffer.buffer.

Public final int readInt () throws IOException {        int ch1 = In.read ();        int CH2 = In.read ();        int CH3 = In.read ();        int CH4 = In.read ();        if ((ch1 | ch2 | CH3 | CH4) < 0)            throw new Eofexception ();        Return (CH1 << + (CH2 << +) + (CH3 << 8) + (CH4 << 0));}

Comparing the above two pieces of code is not difficult to find, intwritable in the readint is more efficient than the original Java Readint, because readint in Java need to operate the underlying data flow, and the readint in Hadoop only need to manipulate the byte array in memory. In addition to the above-mentioned Readint Writablecomparator, there are Readlong, Readfloat, readdouble and other optimized methods, you can implement the custom Writablecomparator by these methods.

The Rawcomparator and realization of Hadoop-2.4.1 learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.