A brief talk on the Java and Hadoop serialization mechanisms

Source: Internet
Author: User

1. Serialization

Serialization (serialization) transforms the state information of an object into a process (a byte stream ) that can be stored or transmitted in a form. During serialization, an object writes its current state to a temporary or persistent store. Later, the object can be recreated by reading or deserializing the state of the object from the store.

Generally, there are three uses:

    • Persistence: objects can be stored on disk
    • Communication: objects can be transmitted over the network
    • Copy, clone: You can serialize an object to a buffer in memory and then deserialize a deep copy of the object (a way to crack a singleton pattern)

2.Java serialization Mechanismto implement serialization in Java, you only need to implement serializable (that is, you do not have to implement any member methods).
Public interface Serializable {}
If you want to serialize an object, you only need to create an input stream ObjectOutputStream object on the OutputStream object, and then call WriteObject ().During serialization, the object's class, class signature, Reza all non-transient and non-static member variables, and all of its parent classes will be written.
Date d = new Date (), outputstream out = new Bytearrayoutputstream (), ObjectOutputStream objout = new ObjectOutputStream (out) ; Objout.writeobject (d);
If you want to serialize a basic type, ObjectOutputStream also provides a variety of methods such as Writeboolean, WriteByte, etc.The reverse sequence process is similar, only need to call ObjectInputStream's ReadObject (), and downward transformation, you can get the correct result. Pros : Easy to implement, can be processed for circular references and repeated references, allowing a certain degree of class membership to change. Support encryption, authentication. Cons : The serialized object takes up too much space and the data expands. The deserialization continuously creates new objects. The serialization of an object of the same class outputs only one copy of the metadata (describing the class relationship), causing the file to not be split. 3.Hadoop serialization Mechanismfor Hadoop, which needs to preserve and process large-scale data, its serialization mechanism achieves the following:
    • Compact: Minimize bandwidth and speed up data exchange
    • Fast processing: interprocess communication requires a lot of data interaction, with a large number of serialization mechanisms, and the expense of serialization and deserialization must be reduced
    • Cross-language: can support data interaction between different languages, such as C + +
    • Extensible: When the system protocol is upgraded and the class definition changes, the serialization mechanism needs to support these upgrades and changes
to support the above features, the writable interface is referenced. Unlike the descriptive serializable interface, it requires two methods to be implemented.
Public interface Writable {  void write (DataOutput out) throws IOException;  void ReadFields (Datainput in) throws IOException;}
For example, we need to implement a class that represents a time period, so we can write
public class Startenddate implements Writable{private date startdate;private date endDate; @Overridepublic void Write ( DataOutput out) throws IOException {Out.writelong (Startdate.gettime ()); Out.writelong (Enddate.gettime ());} @Overridepublic void ReadFields (Datainput in) throws IOException {startdate = new date (In.readlong ()); endDate = new Date (i N.readlong ());} Public Date Getstartdate () {return startdate;} public void Setstartdate (Date startdate) {this.startdate = StartDate;}}

Hadoop also provides several other important interfaces: writablecomparable: It not only provides serialization functionality, but also provides comparison functionality. This comparison is based on the value of the object member after deserialization, which is slower. rawcomparator: because MapReduce relies heavily on key-based comparison ordering ( custom keys also need to override the Hashcode and Equals methods ), Therefore, an optimized interface Rawcomparator is provided. This interface allows direct comparison of records in the data flow without deserializing the data stream into objects, thus avoiding the additional overhead of creating new objects. Rawcomparator is defined as follows, the Compare method can read from each byte array B1 and B2 a given starting position (S1 and S2) and an integer of length L1 and L2 directly.
Public interface Rawcomparator<t> extends comparator<t> {public  int compare (byte[] b1, int s1, int. L1, by te[] B2, int s2, int l2);}

Writablecomparator: is a generic implementation of Rawcomparator, offering two features: provides a default implementation of a Rawcomparator Comparea (), The default implementation simply deserializes the key and then compares it, with no performance benefit. Second, it acts as a factory method for Rawcomaprator instances. when we want to implement a custom key sort (custom grouping), you need to specify your own collation. If you need to startenddate as a key and group by start time, you need to customize the grouping:
Class Mygrouper implements rawcomparator<startenddate> {    @Override public    int compare (Startenddate O1, Startenddate O2) {        return (int) (O1.getstartdate (). GetTime ()-o2.getenddate (). GetTime ());    }    @Override public    int compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {        int comparebytes = Writableco Mparator.comparebytes (B1, S1, 8, B2, S2, 8);        return comparebytes;    }     }
Then set up in the jobJob.Setgroupingcomparatorclass(Mygrouper.class);
It is best to rewrite equals and hashcode as well:
@Overridepublic boolean equals (Object obj) {if (!) ( obj instanceof Startenddate)) return false; Startenddate s = (startenddate) obj;return startdate.gettime () = = S.startdate.gettime () &&enddate.gettime () = = S.enddate.gettime (); } @Overridepublic int hashcode () {int result = N;  Any prime number   result = 31*result +startdate.hashcode ();  result = 31*result +enddate.hashcode ();   return result;};

The Ps:equal and Hashcode methods should also be empty for member variables, and need to be modified later.

References:the Hadoop authoritative guideThe insider of Hadoop technologythe correct way to rewrite hashcode-http://blog.sina.com.cn/s/blog_700aa8830101jtlf.htmlMapReduce Custom Grouping-http://www.luoliang.me/index.php/archives/programminglanguage/56.html

A brief talk on the Java and Hadoop serialization mechanisms

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.