Hadoop serialized "Source Code"

Source: Internet
Author: User
Tags shuffle shuffle shuffle

    • Serialization (serialization) is the conversion of a structured object into a byte stream for the purpose of communicating between processes or writing to the hard disk for permanent storage.
    • Relative deserialization (deserialization) refers to the process of transferring bytes back to a structured object.
    • It is important to note that only byte streams can be transmitted over the network. So, when the intermediate result of map shuffle shuffle between different hosts, the structured object will undergo two processes of serialization (map result write disk) and deserialization (reduce read map result) .

Writable Interface

Instead of using the JAVA serialization mechanism, Hadoop introduces its own serialization system,which defines a large number of serializable objects in the package Org.apache.hadoop.io. These objects all implement the writable interface, and thewritable interface is a common interface for serialized objects. It contains the write () and ReadFields () two serialization related methods.

Writablecomparable Interface

The Writecompareable interface is a two-time encapsulation of the Wirtable interface, and provides a CompareTo (T O) method for the comparison of serialized objects. Because there is a key-based sort phase in the middle of Mr.

Rawcomparable Interface

Hadoop provides a native comparator interface rawcomparator<t> for the optimization of the shuffle phase, which is used for comparison at the byte stream level, thus greatly reducing the time overhead of comparisons. The interface is not implemented by a majority of derived classes, and in most cases its direct subclass Writablecomparator as a built-in class for implementing the Writable interface class, providing a comparison function for the serialized byte.

Writablecomparator class

1). A default implementation of the original compare () method: "Deserialize" as an object, and then through "compare objects", there is an overhead problem. Therefore, the compare () method is required for the specific subclass inheriting writecompatable to speed up efficiency.

//The original compare () is the binary stream that will be compared, first deserialized into an object, and then the comparison method of the object is called.    Public intComparebyte[] B1,intS1,intL1,byte[] B2,intS2,intL2) {    Try { //using buffer for bridging mediation, storing the byte array as bufferBuffer.reset (B1, S1, L1); //call the deserialization method of the Key1 (writablecomparable type)key1.readfields (buffer);                         Buffer.reset (B2, S2, L2);          Key2.readfields (buffer); } Catch(IOException e) {Throw NewRuntimeException (e); }//Call the Compare () comparison method of the writable object to compare    returnCompare (Key1, key2); }
View Code

2). The. Define () method is used to register the Writebalecomparaor object into the registry (Hadoop calls the comparer automatically).

 Public Static void define (Class C, Writablecomparator Comparator) {    comparators.put (c, comparator);   }
View Code

3). The above two methods must be covered in a custom Writablecomparable subclass class to achieve efficient sorting.

byte length of the writable class

Before customizing the writable class, you should first understand the size of the disk space used by different writable classes. By reducing the number of bytes in the writable instance, the data is read faster and the network data is reduced. The following table shows the byte lengths that are consumed by the corresponding writable class after Hadoop is wrapped in Java basic types:

Java Basic Types

Number of bytes

Writable implementation

Number of bytes after serialization (bytes)

Boolean

1/8

Booleanwritable

1

Byte

1

Bytewritable

1

Short

2

Shortwritable

2

Int

4

Intwritable

4

Vintwritable

1–5

Float

4

Floatwritable

4

Long

8

Longwritable

8

Vlongwritable

1–9

Double

8

Doublewritable

8

The byte length after serialization of different writable types is not the same, and it is necessary to consider the appropriate type of data features in the application. There are two options for integer types, one is the fixed-length (fixed-length) writable type, intwritable and longwritable, and the other is the variable-length (variable-length) writable type, Vintwritable and Vlongwritable. The variable-length type is represented by the size of the number using the corresponding byte length, when the value is 1 bytes between -112~127, and the value outside the -112~127 range uses the first byte to represent the positive and negative sign of the value and the byte length (zero-compressed encoded Integer).

For writable selection of integer types, it is recommended that:

    1. Use a variable length writable type unless you are sure of the uniform distribution of the data
    2. For program extensibility, select the Vlongwritable type unless the value interval of the data is determined to be within the range of int
 Packagecn.itcast.hadoop.mr;ImportJava.io.*;ImportOrg.apache.hadoop.io.*;Importorg.apache.hadoop.util.StringUtils;//to test the length of byte arrays used by decimal serialization into different writable types Public classWritablebyteslengthdemo { Public Static voidMain (string[] args)throwsIOException {//The 1 billion is represented by a different writable type.Intwritable Int_b =NewIntwritable (1000000000); Longwritable Long_b=NewLongwritable (1000000000); Vintwritable Vint_b=NewVintwritable (1000000000); Vlongwritable Vlong_b=NewVlongwritable (1000000000); //serializing a different writable type into a byte array        byte[] Bs_int_b =serialize (Int_b); byte[] Bs_long_b =serialize (Long_b); byte[] Bs_vint_b =serialize (Vint_b); byte[] Bs_vlong_b =serialize (Vlong_b); //prints a byte array in 16 binary form and prints the length of the arraysString hex =stringutils.bytetohexstring (Bs_int_b); Formatprint ("Intwritable", "1,000,000,000", Hex, bs_int_b.length); Hex=stringutils.bytetohexstring (Bs_long_b); Formatprint ("Longwritable", "1,000,000,000", Hex, bs_long_b.length); Hex=stringutils.bytetohexstring (Bs_vint_b); Formatprint ("Vintwritable", "1,000,000,000", Hex, bs_vint_b.length); Hex=stringutils.bytetohexstring (Bs_vlong_b); Formatprint ("Vlongwritable", "1,000,000,000", Hex, bs_vlong_b.length); }    //Defining output formats    Private Static voidFormatprint (String type, string param, String hex,intlength) {String format= "%1$-50s%2$-16s with length:%3$2d%n"; System.out.format (format,"Byte array per" +type+ "(" + param + ") is:", hex, length); }    //serializes an object that implements the writable interface into a byte stream     Public Static byte[] Serialize (writable writable)throwsIOException {Bytearrayoutputstream out=NewBytearrayoutputstream (); DataOutputStream Dataout=NewDataOutputStream (out);        Writable.write (dataout);        Dataout.close (); returnOut.tobytearray (); }    //deserialization     Public StaticWritable deserialize (writable writable,byte[] bytes)throwsIOException {Bytearrayinputstream in=Newbytearrayinputstream (bytes); DataInputStream DataIn=NewDataInputStream (in);        Writable.readfields (DataIn);        Datain.close (); returnwritable; }}
View Code

Byte array per intwritable (1,000,000,000) is:3b9aca00 with Length:4

Byte array per longwritable (1,000,000,000) is:000000003b9aca00 with Length:8

Byte array per vintwritable (1,000,000,000) is:8c3b9aca00 with Length:5

Byte array per vlongwritable (1,000,000,000) is:8c3b9aca00 with Length:5

From the above output we can see:

    • L 1,000,000,000 for different writable takes up different byte lengths
    • Variable-length types are not always more space-saving than fixed-length, because they require an extra byte to hold positive and negative information and byte lengths.

byte sequence of text

    1. It is easy to think that the text class is the writable type of java.lang.String, and note that the text class uses UTF-8 encoding for Unicode characters and encodes characters using variable-length 1~4 bytes . Only 1 bytes are used for ASCII characters, and four bytes are used for high ASCII and multibyte characters. Instead of using the UTF-16 encoding of the Java character class.
    2. For the original GBK encoded data using the text read into the direct use of String line=value.tostring (), the method will become garbled . The correct method is to convert the value of the input text type to a byte array, using the string constructor string (byte[] bytes, int offset, int length, Charset Charset), Constructs a new string by decoding the specified byte array with the specified charset. That is, string line=new string (Value.getbytes (), 0,value.getlength (), "GBK");
    3. The byte sequence of the text class is represented as "a vintwritable + UTF-8 stream". Where vintwritable represents the character length of the text type, and the UTF-8 byte array is the true text stream.

The following is a description of the code in the byte comparison in the text class:

/**A Writablecomparator optimized for Text keys.*/   Public Static classComparatorextendsWritablecomparator { PublicComparator () {Super(Text.class); } @Override//B1 represents a byte array; S1 represents the starting byte of a text type; L1 represents the byte length of a text type     Public intComparebyte[] B1,intS1,intL1,byte[] B2,intS2,intL2) { //returns the character length of text      intN1 =writableutils.decodevintsize (b1[s1]); intN2 =writableutils.decodevintsize (B2[s2]);//The comparer skips bytes that represent the length of the text character, directly compared to the bytes of the true string portion of UTF encoding//The comparebytes () method is a comparison of bytes. Once you find a different one, and then you return the result, the back of whatever      returnComparebytes (B1, S1+n1, L1-n1, B2, S2+n2, l2-n2); }  }
View Code

Hadoop serialized "Source Code"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.