Serialization is the conversion of the state information of an in-memory object into a sequence of bytes for storage (persistence) and network transport
Deserialization is the conversion of a received sequence of bytes or persistent data from a hard disk into an in-memory object.
Serialization of 1.JDK
As long as the implementation of the serializable interface can be serialized and deserialized, it is important to add the serialized version ID Serialversionuid, which is used to identify the serialization of the class before the end of which. For example, if you want different versions of a class to be compatible with serialization, you need to ensure that different versions of the class have the same serialversionuid;
The Java serialization algorithm needs to consider:
Outputs the class metadata related to the object instance.
Recursively outputs a superclass description of a class until there are no more super-classes.
After the class metadata is finished, start outputting the actual data value of the object instance from the topmost superclass
Recursive output of data from top to bottom instances
So Java serialization is very powerful, the serialization of the information is very detailed, but the serialization of memory.
2.Hadoop serialization
Compared to the JDK relatively concise, in the urgent mass of information transmission is mainly by these serialized byte building to pass, so faster speed, smaller capacity.
Features of Hadoop serialization:
1. Compact: Bandwidth is the most valuable resource for information transmission in a cluster so we have to try to reduce the size of the message.
Java serialization is not flexible enough, in order to better control the entire process of serialization, so use writable
Java serialization preserves all information dependencies for a class, and Hadoop serialization does not require
2. Object reuse: The deserialization of the JDK will create the object continuously, which will certainly incur some overhead, but in the deserialization of Hadoop, the Readfield method of an object can be reused to recreate different objects.
The Java serialization will recreate the object each time it is serialized and memory consumption is large. Writable can be reused.
3. Extensibility
Hadoop writes its own serialization easily, using the writable interface to implement Hadoop to achieve a direct comparison of character streams to determine the size of two writable objects.
While Java is not, the serialization mechanism of Java saves each class's information for the first occurrence of the object, such as the class name, the second occurrence of the class object will have a class of reference, resulting in wasted space
Frameworks such as protocol Buffers,avro can be used with an open source serialization framework
Hadoop native serialization classes need to implement an interface called writeable, similar to the Serializable interface
Implementing the writable interface must implement two methods: Write (DataOutputStream out), Readfield (DataInputStream in) method.
Yarn serialization is a serialized framework developed with Google protocol Buffers,proto currently supports three languages C++,java,python so RPC this layer we can use other languages to make a fuss
Apache's thrift and Google's protocol buffer are also popular serialization frameworks, but use in Hadoop is limited and only used for RPC and data interaction
Hadoop serialization vs. Java serialization