Hadoop serialization and Writable Interface (ii)

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Rita Byte nbsp;

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop serialization and Writable Interface (i) introduced the Hadoop serialization, the Hadoop writable interface and how to customize your own writable class, and in this article we continue to introduce the Hadoop writable class, This time we are concerned about the length of bytes occupied after the writable instance was serialized, and the composition of the sequence of bytes after the writable instance was serialized.

Why consider the byte length of the writable class

Do large data programs also need to consider the amount of disk space that a serialized object occupies? Perhaps you would think that big data is not just a big amount of data, that disk space must be large enough, a serialized object takes up only a few to dozens of bytes of space, relative to disk space, of course, do not need to consider too much, if your disk space is not large enough, or do not play large data good.

There is nothing wrong with the view above, http://www.aliyun.com/zixun/aggregation/8213.html "> Large data applications naturally require enough disk space, However, it is possible to take into account the size of the different writable classes, the efficient use of disk space is not necessarily unnecessary, choose the appropriate writable class is another role by reducing the number of writable instances of bytes, can speed up the reading of data and reduce network data transmission.

Length of bytes occupied by the writable class

The following table shows the byte length that Hadoop uses for the corresponding writable class after wrapping the Java base type:

The length of words occupied by different writable classes is not the same, so it is necessary to consider the appropriate types of data features in the application. There are two types of writable for integer types, one is fixed-length (fixed-length) writable type, intwritable and longwritable The other is the variable length (variable-length) writable type, vintwritable and vlongwritable. A fixed-length type, as the name suggests, uses a fixed-size number of bytes, such as a intwritable type that uses 4-length bytes to represent an int; A variable length type uses the corresponding byte length, depending on the size of the value, which is represented by 1 bytes when the value is between -112~127. Values outside the -112~127 range use the first byte to represent the positive and negative symbols of the value and the byte length (zero-compressed encoded integer).

The fixed-length writable type is suitable for the uniform distribution of the case, while the writable type of the variable is suitable for the uneven distribution of the value, in general, the length of the writable type is more space-saving, because in most cases the numerical value is Non-uniform, for the integer type of writable selection, I suggest:

1. Use variable length writable type unless you are sure of the uniform distribution of the data

2. Select vlongwritable type for program scalability unless the range of data is determined within the scope of int

Byte sequence of integral type writable

The following example demonstrates the byte length occupied by the Hadoop integer writable object and the structure of the byte sequence after the writable object is serialized, especially the variable-length integer writable instance, see the following code and program output:

Package com.yoyzhou.example;

Import java.io.*;

Import org.apache.hadoop.io.*;

Import Org.apache.hadoop.util.StringUtils;

/**

* Demos per how many bytes/each built-in writable type takes and what does

* Misspelling bytes sequences look like

* @author Yoyzhou

public class Writablebyteslengthdemo {

public static void Main (string] args) throws IOException {

One billion representations by different writable object

Intwritable int_b = new intwritable (1000000000);

Longwritable Long_b = new longwritable (1000000000);

Vintwritable vint_b = new vintwritable (1000000000);

Vlongwritable Vlong_b = new vlongwritable (1000000000);

Serialize writable object to byte array

BYTE] Bs_int_b = Serialize (Int_b);

BYTE] Bs_long_b = Serialize (Long_b);

BYTE] Bs_vint_b = Serialize (Vint_b);

BYTE] Bs_vlong_b = Serialize (Vlong_b);

Print byte array in hex string and misspelling length

String hex = stringutils.bytetohexstring (bs_int_b);

Formatprint ("Intwritable", "1,000,000,000", Hex, bs_int_b.length);

Hex = stringutils.bytetohexstring (Bs_long_b);

Formatprint ("Longwritable", "1,000,000,000", Hex, bs_long_b.length);

Hex = stringutils.bytetohexstring (bs_vint_b);

Formatprint ("Vintwritable", "1,000,000,000", Hex, bs_vint_b.length);

Hex = stringutils.bytetohexstring (Bs_vlong_b);

Formatprint ("Vlongwritable", "1,000,000,000", Hex, bs_vlong_b.length);

}

private static void Formatprint (string type, string param, string hex, int length) {

String format = "%1$-50s%2$-16s with length:%3$2d%n";

System.out.format (format, "Byte array per" + type

+ "(" + param + ") is:", hex, length);

}

/**

* Utility to serialize writable object, return byte array

* Representing the Writable object

* */

public static byte] Serialize (writable writable) throws IOException {

Bytearrayoutputstream out = new Bytearrayoutputstream ();

DataOutputStream dataout = new DataOutputStream (out);

Writable.write (dataout);

Dataout.close ();

return Out.tobytearray ();

}

/**

* Utility to deserialize input byte array, return writable object

* */

public static writable deserialize (writable writable, byte] bytes)

Throws IOException {

Bytearrayinputstream in = new Bytearrayinputstream (bytes);

DataInputStream DataIn = new DataInputStream (in);

Writable.readfields (DataIn);

Datain.close ();

return writable;

}

Program output:

Byte array per intwritable (1,000,000,000) is: \

3b9aca00 with Length:4

Byte array per longwritable (1,000,000,000) is: \

000000003b9aca00 with Length:8

Byte array per vintwritable (1,000,000,000) is: \

8c3b9aca00 with Length:5

Byte array per vlongwritable (1,000,000,000) is:\

8c3b9aca00 with Length:5

From the above output we can see:

+ 1,000,000,000 indicates different writable occupy different byte lengths

+ Variable length writable type is not always more space-saving than fixed-length type, when Intwritable occupies 4 bytes, longwritable occupies 8 bytes, the corresponding variable length writable need an extra byte to hold the positive and negative information and byte length. So to go back to the problem of the first integer type selection, select the most suitable integer writable type, we should have a certain understanding of the overall distribution of the value.

byte sequence of text

It is easy to think of the text class as the writable type of java.lang.String, but note that the text class uses the UTF-8 encoding for Unicode characters rather than the UTF-16 encoding of the Java character class.

The Java character class uses UTF-16 encoding [1], which follows Unicode Standard version 4, and each character is encoded with a fixed length of 16 bits (two bytes) for code points higher than basic multilingual Plane (BMP , the supplemental character of the code point U+0000~U+FFFF, represented by two proxy characters.

The text class uses the UTF-8 encoding to encode characters using variable-length 1~4 bytes. For ASCII characters using only 1 bytes, and for high ASCII and multibyte characters using 2~4 bytes, I think that Hadoop chose to use UTF-8 instead of string at design time UTF-16 is based on the above reasons, in order to save byte length/space considerations.

Because text uses UTF-8 encoding, the text class does not provide as many operations as string, and you must pay attention to this distinction when manipulating text objects, such as indexing and according, but we recommend that when you do the text operation, If you might be able to convert the text object to a string, then do the operation.

The byte sequence of the text class is expressed as a vintwritable + UTF-8 stream, vintwritable for the entire text character length, and the UTF-8 byte array for the true text stream. Please see the following code fragment:

...//omitted per conciseness

Text MyText = new text ("my Text");

BYTE] Text_bs = Serialize (MyText);

Hex = stringutils.bytetohexstring (TEXT_BS);

Formatprint ("Text", "\" My Text\ "", Hex, text_bs.length);

Text myText2 = new text ("my Text");

BYTE] Text2_bs = Serialize (MYTEXT2);

Hex = stringutils.bytetohexstring (TEXT2_BS);

Formatprint ("text", "\" my text \ "", Hex, text2_bs.length);

...

Byte array per Text (' My Text ') is: \

076d792074657874 with Length:8

Byte array per text ("My Text") is: \

0CE68891E79A84E69687E69CAC with Length:13

In the above output, the first byte represents the length of the text/text of the segment, and the UTF-8 encoding "my text" occupies a length of 7 bytes (07), while the Chinese "my" byte length is 12 bytes (0c).

Customizing the byte sequence of the writable class

In this section we will use the Mywritable class in the previous article to illustrate that mywritable is a custom writable type consisting of two vlongwritable classes.

Program output:

Byte array per mywritable (1000, 1000000000) is: \

8e03e88c3b9aca00 with Length:8

From the output we can see very clearly that the custom writable class byte sequence is actually the basic writable type combination, the output "8e03e88c3b9aca00" the first three bytes are 1000 of the vlongwritable byte sequence, " 8c3b9aca00 "is a 1000000000VLongWritable byte sequence, which can be found in the write method of the Mywritable class we write:

...//omitted per conciseness

@Override

public void Write (DataOutput out) throws IOException {

Field1.write (out);

Field2.write (out);

}

...

Summary

This paper introduces the byte length used in the serialization of Hadoop writable class by an example, and analyzes the structure of the byte sequence after the serialization of the writable class. Note that the text class uses UTF-8 encoding for space purposes rather than the UTF-16 encoding of the Java character, and the custom writable byte sequence is related to the write () method of the writable class.

Finally, writable is the core of Hadoop serialization, and understanding the byte length and byte sequence of Hadoop writable is critical to selecting the right writable object and manipulating writable objects at the byte level.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More