Hadoop serialization and Writable Interface (i) introduced the Hadoop serialization, the Hadoop writable interface and how to customize your own writable class, and in this article we continue to introduce the Hadoop writable class, This time we are concerned about the length of bytes occupied after the writable instance was serialized, and the composition of the sequence of bytes after the writable instance was serialized.
Why consider the byte length of the writable class
Do large data programs also need to consider the amount of disk space that a serialized object occupies? Perhaps you would think that big data is not just a big amount of data, that disk space must be large enough, a serialized object takes up only a few to dozens of bytes of space, relative to disk space, of course, do not need to consider too much, if your disk space is not large enough, or do not play large data good.
There is nothing wrong with the view above, http://www.aliyun.com/zixun/aggregation/8213.html "> Large data applications naturally require enough disk space, However, it is possible to take into account the size of the different writable classes, the efficient use of disk space is not necessarily unnecessary, choose the appropriate writable class is another role by reducing the number of writable instances of bytes, can speed up the reading of data and reduce network data transmission.
Length of bytes occupied by the writable class
The following table shows the byte length that Hadoop uses for the corresponding writable class after wrapping the Java base type:
The length of words occupied by different writable classes is not the same, so it is necessary to consider the appropriate types of data features in the application. There are two types of writable for integer types, one is fixed-length (fixed-length) writable type, intwritable and longwritable The other is the variable length (variable-length) writable type, vintwritable and vlongwritable. A fixed-length type, as the name suggests, uses a fixed-size number of bytes, such as a intwritable type that uses 4-length bytes to represent an int; A variable length type uses the corresponding byte length, depending on the size of the value, which is represented by 1 bytes when the value is between -112~127. Values outside the -112~127 range use the first byte to represent the positive and negative symbols of the value and the byte length (zero-compressed encoded integer).
The fixed-length writable type is suitable for the uniform distribution of the case, while the writable type of the variable is suitable for the uneven distribution of the value, in general, the length of the writable type is more space-saving, because in most cases the numerical value is Non-uniform, for the integer type of writable selection, I suggest:
1. Use variable length writable type unless you are sure of the uniform distribution of the data
2. Select vlongwritable type for program scalability unless the range of data is determined within the scope of int
Byte sequence of integral type writable
The following example demonstrates the byte length occupied by the Hadoop integer writable object and the structure of the byte sequence after the writable object is serialized, especially the variable-length integer writable instance, see the following code and program output:
Package com.yoyzhou.example;
Import java.io.*;
Import org.apache.hadoop.io.*;
Import Org.apache.hadoop.util.StringUtils;
/**
* Demos per how many bytes/each built-in writable type takes and what does
* Misspelling bytes sequences look like
*
* @author Yoyzhou
*
*/
public class Writablebyteslengthdemo {
public static void Main (string] args) throws IOException {
One billion representations by different writable object
Intwritable int_b = new intwritable (1000000000);
Longwritable Long_b = new longwritable (1000000000);
Vintwritable vint_b = new vintwritable (1000000000);
Vlongwritable Vlong_b = new vlongwritable (1000000000);
Serialize writable object to byte array
BYTE] Bs_int_b = Serialize (Int_b);
BYTE] Bs_long_b = Serialize (Long_b);
BYTE] Bs_vint_b = Serialize (Vint_b);
BYTE] Bs_vlong_b = Serialize (Vlong_b);
Print byte array in hex string and misspelling length
String hex = stringutils.bytetohexstring (bs_int_b);
Formatprint ("Intwritable", "1,000,000,000", Hex, bs_int_b.length);
Hex = stringutils.bytetohexstring (Bs_long_b);
Formatprint ("Longwritable", "1,000,000,000", Hex, bs_long_b.length);
Hex = stringutils.bytetohexstring (bs_vint_b);
Formatprint ("Vintwritable", "1,000,000,000", Hex, bs_vint_b.length);
Hex = stringutils.bytetohexstring (Bs_vlong_b);
Formatprint ("Vlongwritable", "1,000,000,000", Hex, bs_vlong_b.length);
}
private static void Formatprint (string type, string param, string hex, int length) {
String format = "%1$-50s%2$-16s with length:%3$2d%n";
System.out.format (format, "Byte array per" + type
+ "(" + param + ") is:", hex, length);
}
/**
* Utility to serialize writable object, return byte array
* Representing the Writable object
*
* */
public static byte] Serialize (writable writable) throws IOException {
Bytearrayoutputstream out = new Bytearrayoutputstream ();
DataOutputStream dataout = new DataOutputStream (out);
Writable.write (dataout);
Dataout.close ();
return Out.tobytearray ();
}
/**
* Utility to deserialize input byte array, return writable object
*
* */
public static writable deserialize (writable writable, byte] bytes)
Throws IOException {
Bytearrayinputstream in = new Bytearrayinputstream (bytes);
DataInputStream DataIn = new DataInputStream (in);
Writable.readfields (DataIn);
Datain.close ();
return writable;
}
}
Program output:
Byte array per intwritable (1,000,000,000) is: \
3b9aca00 with Length:4
Byte array per longwritable (1,000,000,000) is: \
000000003b9aca00 with Length:8
Byte array per vintwritable (1,000,000,000) is: \
8c3b9aca00 with Length:5
Byte array per vlongwritable (1,000,000,000) is:\
8c3b9aca00 with Length:5
From the above output we can see:
+ 1,000,000,000 indicates different writable occupy different byte lengths
+ Variable length writable type is not always more space-saving than fixed-length type, when Intwritable occupies 4 bytes, longwritable occupies 8 bytes, the corresponding variable length writable need an extra byte to hold the positive and negative information and byte length. So to go back to the problem of the first integer type selection, select the most suitable integer writable type, we should have a certain understanding of the overall distribution of the value.
byte sequence of text
It is easy to think of the text class as the writable type of java.lang.String, but note that the text class uses the UTF-8 encoding for Unicode characters rather than the UTF-16 encoding of the Java character class.
The Java character class uses UTF-16 encoding [1], which follows Unicode Standard version 4, and each character is encoded with a fixed length of 16 bits (two bytes) for code points higher than basic multilingual Plane (BMP , the supplemental character of the code point U+0000~U+FFFF, represented by two proxy characters.
The text class uses the UTF-8 encoding to encode characters using variable-length 1~4 bytes. For ASCII characters using only 1 bytes, and for high ASCII and multibyte characters using 2~4 bytes, I think that Hadoop chose to use UTF-8 instead of string at design time UTF-16 is based on the above reasons, in order to save byte length/space considerations.
Because text uses UTF-8 encoding, the text class does not provide as many operations as string, and you must pay attention to this distinction when manipulating text objects, such as indexing and according, but we recommend that when you do the text operation, If you might be able to convert the text object to a string, then do the operation.
The byte sequence of the text class is expressed as a vintwritable + UTF-8 stream, vintwritable for the entire text character length, and the UTF-8 byte array for the true text stream. Please see the following code fragment:
...//omitted per conciseness
Text MyText = new text ("my Text");
BYTE] Text_bs = Serialize (MyText);
Hex = stringutils.bytetohexstring (TEXT_BS);
Formatprint ("Text", "\" My Text\ "", Hex, text_bs.length);
Text myText2 = new text ("my Text");
BYTE] Text2_bs = Serialize (MYTEXT2);
Hex = stringutils.bytetohexstring (TEXT2_BS);
Formatprint ("text", "\" my text \ "", Hex, text2_bs.length);
...
Byte array per Text (' My Text ') is: \
076d792074657874 with Length:8
Byte array per text ("My Text") is: \
0CE68891E79A84E69687E69CAC with Length:13
In the above output, the first byte represents the length of the text/text of the segment, and the UTF-8 encoding "my text" occupies a length of 7 bytes (07), while the Chinese "my" byte length is 12 bytes (0c).
Customizing the byte sequence of the writable class
In this section we will use the Mywritable class in the previous article to illustrate that mywritable is a custom writable type consisting of two vlongwritable classes.
Program output:
Byte array per mywritable (1000, 1000000000) is: \
8e03e88c3b9aca00 with Length:8
From the output we can see very clearly that the custom writable class byte sequence is actually the basic writable type combination, the output "8e03e88c3b9aca00" the first three bytes are 1000 of the vlongwritable byte sequence, " 8c3b9aca00 "is a 1000000000VLongWritable byte sequence, which can be found in the write method of the Mywritable class we write:
...//omitted per conciseness
@Override
public void Write (DataOutput out) throws IOException {
Field1.write (out);
Field2.write (out);
}
...
Summary
This paper introduces the byte length used in the serialization of Hadoop writable class by an example, and analyzes the structure of the byte sequence after the serialization of the writable class. Note that the text class uses UTF-8 encoding for space purposes rather than the UTF-16 encoding of the Java character, and the custom writable byte sequence is related to the write () method of the writable class.
Finally, writable is the core of Hadoop serialization, and understanding the byte length and byte sequence of Hadoop writable is critical to selecting the right writable object and manipulating writable objects at the byte level.