Serialization principles from Hadoop Writable serialization framework to java
After the previous module, the analysis content comes to the Hadoop IO-related module. The module of the IO system is a relatively large module, In the Hadoop Common io, it consists of two major sub-modules, one is a serialization Module Based on the Writable interface, and the other is a decompression module. Therefore, we plan to divide it into two modules for analysis. Today we will talk about serialization, deserialization analysis and learning, of course, is not just simple wrtite, read, and other simple scheduling. Before analysis, check the class diagram of the IO package:
In Hadoop, You can implement serialization in java, but it is not recommended because of the characteristics of the distributed environment of Hadoop system, it is better to call a new serialization implementation designed by Hadoop itself. In java, the Serializable interface can be serialized, while in HadoZ runtime? Http://www.bkjia.com/kf/ware/vc/ "target =" _ blank "class =" keylink "> placement =" brush: java; "> public interface Writable {/*** Serialize the fields of this objectout
. ** @ Param outDataOuput
To serialize this object into. * @ throws IOException */void write (DataOutput out) throws IOException;/*** Deserialize the fields of this object fromin
.**
For efficiency, implementations shocould attempt to re-use storage in the * existing object where possible.
** @ Param in
DataInput
To deseriablize this object from. * @ throws IOException */void readFields (DataInput in) throws IOException ;}It defines two simple methods. Some basic data types inherit this interface, such as the following:
The child types above are just a part of the list. Basically, each basic type has a corresponding Writable form. The interface methods have been implemented in the default sub-types. Therefore, you can ignore the details during use. Another special design in the above serialization framework is the design of the WritableFactories factory.
/** Factories for non-public writables. defining a factory permits {@ link * ObjectWritable} to be able to construct instances of non-public classes. */public class WritableFactories {/*** Save the Writable factory mode variables */private static final HashMap
CLASS_TO_FACTORY = new HashMap
(); Private WritableFactories () {}// singleton
A map container is maintained here. So what is WritableFactory:
/** A factory for a class of Writable. * @see WritableFactories */public interface WritableFactory { /** Return a new instance. */ Writable newInstance();}
Actually, it is an interface. However, through this factory, you can return the Writable object you want. The Writable object generated by this factory may be used in some ObjectWritable readFileds () methods. Of course, Hadoop can also access other excellent serialization frameworks, such as Hadoop Avro, Apache Thrift or Google Protocol Buffer.
All of the above are the Serialization mechanisms in Hadoop. Why can't Hadoop adopt the Serialization mechanism of Serialization that comes with java? This problem must be attributed to the Hadoop system itself, as a distributed platform, Hadoop transmits data in the form of RPC, which has a significant impact on the network environment and bandwidth. So at this time, it is required that the transmitted data should be short, refined, serialized, and deserialized as quickly as possible. However, some of these requirements are not met in the built-in java mechanism. The following is an example of java built-in program serialization:
Declares A Class A and A class B:
public class A implements Serializable{protected int age;}
public class B extends A{/** * */private static final long serialVersionUID = -5079514899384299792L;private float height;private ContainClass contain = new ContainClass();public void setHeight(float height){this.height = height;}}
There is also a contain internal class:
public class ContainClass implements Serializable{private int value = 11;}
Then, call in the scenario class to serialize the information of Object B to the file:
/*** Serialization test class ** @ author lyq **/public class Client {public static void main (String [] args) {B B B = new B (); B. age = 22; B. setHeight (170); String filePath = "D: test.txt"; serialization (filePath, B );} /*** serialize to file ** @ param filePath * @ param obj * @ return * @ throws RuntimeException * if an error occurs */public static void serialization (String filePath, object obj) {ObjectOutputStream out = null ; Try {out = new ObjectOutputStream (new FileOutputStream (filePath); out. writeObject (obj); out. close ();} catch (FileNotFoundException e) {throw new RuntimeException ("FileNotFoundException occurred. ", e);} catch (IOException e) {throw new RuntimeException (" IOException occurred. ", e) ;}finally {if (out! = Null) {try {out. close () ;}catch (IOException e) {throw new RuntimeException ("IOException occurred.", e );}}}}}
Finally, it is serialized to the directory of my d disk. If you open it directly using txt, it is estimated that nothing can be seen. It is recommended that, use a code editor in hexadecimal format to open it. The following figure is displayed:
AC ED 00 05 73 72 00 0F 53 65 72 69 61 6C 69 7A 65 54 65 73 74 2E 42 B9 81 F0 1C 86 E5 0E F0 02 00 02 46 00 06 68 65 69 67 68 74 4C 00 07 63 6F 6E 74 61 69 6E 74 00 1C 4C 53 65 72 69 61 6C 69 7A 65 54 65 73 74 2F 43 6F 6E 74 61 69 6E 43 6C 61 73 73 3B 78 72 00 0F 53 65 72 69 61 6C 69 7A 65 54 65 73 74 2E 41 D7 91 1A 56 65 43 36 5D 02 00 01 49 00 03 61 67 65 78 70 00 00 00 16 43 2A 00 00 73 72 00 1A 53 65 72 69 61 6C 69 7A 65 54 65 73 74 2E 43 6F 6E 74 61 69 6E 43 6C 61 73 73 72 41 F3 7A E0 C6 43 DF 02 00 01 49 00 05 76 61 6C 75 65 78 70 00 00 00 0B
If you look at the above bytecode, you may have a big head. Next, let's take a look at the corresponding two-dimensional table.
We can clearly see that the class name information of Class B is retained, the information of its parent class A, and the variable field in Class, it can be said that it is still very complex information. Imagine that I have defined two classes in a very simple way, and the variables are only 1 or 2 int integer values, which is already so huge, if a complex class needs to be serialized, the size of the serialized file must be small. Therefore, this is exactly why Hadoop does not adopt the built-in java serialization mechanism. However, since we have already analyzed this, We will thoroughly understand the principles of java's serialization mechanism. Many people may have used serialization frequently, but I still don't know what the serialized things are. Let's continue with the previous year's example. We will use the above bytecode to explain them one by one. Before interpreting java, it is necessary to know the process of serializing a simple object in java:
1. output class metadata related to the object instance
2. recursively output the parent class description of the class until no parent class exists.
3. After the class metadata is output, the data value of the object instance is output from the top parent class.
4. recursively output all values of the instance from top to bottom.
Compare with the above algorithm, let's try to analyze it. First, output the serialized bytecode:
AC ED00 05 73 7200 0F 53 65 72 69 61 6C 69 7A
65 54 65 73 74 2E 42 B9 81 F0 1C 86 E5 0E F0 02
00 02 46 00 0668 65 69 67 68 74 4C 00 07 63 6F
6E 74 61 69 6E 74 00 1C 4C 53 65 72 69 61 6C 69
7A 65 54 65 73 74 2F 43 6F 6E 74 61 69 6E 43 6C
61 73 73 3B 78 7200 0F 53 65 72 69 61 6C 69 7A
65 54 65 73 74 2E 41 D7 91 1A 56 65 43 36 5D02
00 01 49 00 0361 67 65 78 70 00 00 1643 2A
00 00 73 7200 1A 53 65 72 69 61 6C 69 7A 65 54
65 73 74 2E 43 6F 6E 74 61 69 6E 43 6C 61 73
72 41 F3 7A E0 C6 43 DF 02 00 01 49 00 05 76 61
6C 75 65 78 70 00 00 00 0B
A bit inconsistent between letters and sizes leads to incorrect matching, but you can check the following:
FirstAc ed: magic number, indicating that the serialization protocol is used
00 05: Represents the serialization Protocol version
0x73: declares that this is a new object.
First, the first step of the algorithm is to output the class metadata of the object class:
First A -----> B
0x72: indicates the Start mark of a new class.
00 0F: length of the new Class name. The value here is 15, because the name is SerializeTest. B, a total of 15 characters
53 65 72 69 61 6C 69 7A
65 54 65 73 74 2E 42:Indicates the class name. The value is SerializeTest. B, which is mapped by the character Ascall code system. For example, the Code System of A is 65, and that of a is 65 + 32 = 97.
B9 81 F0 1C 86 E5 0E F0: serialization ID of the class. If you do not set it, an 8-byte ID will be randomly generated by the algorithm, because I want to distinguish A, B, set B to a serialized ID.
public class B extends A{/** * */private static final long serialVersionUID = -5079514899384299792L;
0x02: Mark number. The value indicates that the object can be serialized.
Below is the number of metadata in the output:
00 02: indicates the number of included variables. Here there are two variables. One of the fields in B is height and the other is the contain Class.
Next, the variable information is output:
0x46: float Type, int type: 49
00 06: the length of the domain name. Here it is 6, which is the height.
68 65 69 67 68 74: The description of the domain name, which is the character height.
Then, the output of another domain variable is the internal class contain:
0x4c: the type of the field, representing a reference
00 07: The domain name length is 7, that is, contain
63 6F 6E 74 61 69 6E: Class Name Description, contain
0x74: A string is used to reference an object.
00 1C: the length of the string is 28.
4C 53 65 72 69 61 6C 69
7A 65 54 65 73 74 2F 43 6F 6E 74 61 69 6E 43 6C
61 73 3B: JVM bytecode standard signature descriptor, LSerializeTest/ContainClass;
0x78: End mark of the contain data class
The information of the parent class is output below.
0x72: represents a new class
00 0F: Name Length 15
53 65 72 69 61 6C 69 7A
65 54 65 73 74 2E 41: the descriptor is SerializeTest.
D7 91 1A 56 65 43 36 5D: represents an 8-byte serialization ID, which is automatically generated by the system.
0x02: Mark number. The value indicates that the object can be serialized.
00 01: represents the number of included variables. Here is 1 and 1 is age.
0x49: indicates the int type.
00 03: the length of the domain name. Here it is 3, that is, age.
61 67 65: Description of the domain name, which is the character age
0x78: End mark of Data Class,
0x70: indicates that A does not have A parent class.
At this point, all the class metadata has been output, followed by the top-down output data value:
00 00 00 16: The value is 16*1 + 6 = 22, which is just the value of age I set.
43 2A 00 00: the float type height value is expressed in the form of 170. I don't know how the two correspond to different shapes ....
The following is the value of the output contain field, but the contain Class has not output metadata, so you have to execute operations similar to A and B above;
0x73: declares that this is a new object.
0x72: indicates the Start mark of a new class.
00 1A: The length is 26
53 65 72 69 61 6C 69 7A 65 54
65 73 74 2E 43 6F 6E 74 61 69 6E 43 6C 61 73 73: The description length of the field is 26 bytes, which isSerializeTest. ContainClass
72 41 F3 7A E0 C6 43 DF: 8-byte serialized ID
0x02: Mark number. The value indicates that the object can be serialized.
00 01: indicates that the number of domains is one, which is the int vaule in it.
0x49: int type Integer 00 05 length is 5,
76 61 6C 75 65: descriptor is value
0x78: End mark of the Data class of the contain Class,
0x70: no parent class for contain
After describing the metadata of the entire internal class, return to the previous topic, that is, the output value, and the final contain. value is left:
00 00 00 0B: 11 in decimal format
Let's take a rough look at the serialization process.
Output variable information of B -----> output variable information of parent class A ------> output variable value of A ------ output variable value of B ------ output variable value of contain class found no variable information for the internal contain Class is output ----- in variable information is output again ------ at last the variable value of the contain Class is output
In general, it is more complicated. It is similar to the generation of bytecode in jvm programs.