Brief introduction
Serialization and deserialization are transformations between structured objects and byte streams, mainly used for communication and persistent storage of internal processes.
Communication Format Requirements
Hadoop's internal communication between nodes uses the RPC,RPC protocol to translate messages into binary byte streams to remote nodes, and the remote node then deserializes the binary into the original information by deserializing it. The serialization of RPC needs to implement the following points:
1. Compression, can play the effect of compression, the use of broadband resources to small.
2. Fast, internal processes build high speed links for distributed systems, so it must be fast between serialization and deserialization, and cannot make transmission speed a bottleneck.
3. Extensible, new server adds a parameter for the new client, so the old client can use it.
4. Good compatibility, can support the client of multiple languages
Storage format Requirements
On the surface it appears that the serialization framework may require some other features in the persistence of storage, but in fact it remains the four points:
1. Compressed, less space occupied
2. Fast, can read and write quickly
3. Scalable, old data can be read in old format
4. Good compatibility, can support reading and writing in multiple languages
Serialization format for Hadoop
The serialized storage format for Hadoop itself is the class that implements the writable interface, and he only implements the first two points, compression and speed. But it's not easy to expand and not to cross languages.
Let's take a look at the writable interface, and the writable interface defines two methods:
1. Write data to the binary stream
2. Reading data from binary data streams
Package Org.apache.hadoop.io;
Public interface Writable {
void write (Java.io.DataOutput p1) throws java.io.IOException;
void ReadFields (Java.io.DataInput p1) throws java.io.IOException;
}