Hadoop serialization and Writable APIs (1)

Source: Internet
Author: User
Serialization refers to converting a structured object into a byte stream for transmission over the network or writing it to the hard disk for permanent storage) it refers to the process of transferring bytes back to a structured object. In a distributed system, a process serializes an object into a byte stream and transmits it to another object over the network.

Serialization refers to converting a structured object into a byte stream for transmission over the network or writing it to the hard disk for permanent storage) it refers to the process of transferring bytes back to a structured object. In a distributed system, a process serializes an object into a byte stream and transmits it to another object over the network.

Serialization

SerializationSerialization refers to converting a structured object into a byte stream for transmission over the network or writing it to the hard disk for permanent storage.Deserialization(Deserialization) refers to the process of transferring bytes back to a structured object.

In a distributed system, a process serializes an object into a byte stream and transmits it to another process over the network. Another process receives the byte stream and returns it to a structured object through deserialization, to achieve inter-process communication. In Hadoop, the serialization and deserialization technologies must be used for communication between Mapper, Combiner, and CER stages. For example, the intermediate result ( ) Needs to be written to the local hard disk. This is a serialization process (converting a structured object into a byte stream and writing it to the hard disk ), in the Reducer stage, the process of reading the intermediate results of Mapper is a deserialization process (read the byte stream files stored on the hard disk and convert them back to a structured object). Note that, only byte streams can be transmitted over the network. When the intermediate results of ER er are shuffled between hosts, the object will undergo serialization and deserialization.

Serialization is a core part of Hadoop. In Hadoop, the Writable interface in the org. apache. hadoop. io package is the implementation of Hadoop serialization format.

Writable Interface

The Hadoop Writable interface is a serialization protocol based on DataInput and DataOutput. It is compact (using storage space efficiently) and fast (with low overhead for data reading/writing, serialization, and deserialization ). The key and value in Hadoop must be the objects that implement the Writable interface (The key must also implement WritableComparable for sorting ).

The following is the description of the Writable interface in Hadoop (using Hadoop 1.1.2:

package org.apache.hadoop.io;import java.io.DataOutput;import java.io.DataInput;import java.io.IOException;public interface Writable {  /**    * Serialize the fields of this object to out.   *    * @param out DataOuput to serialize this object into.   * @throws IOException   */  void write(DataOutput out) throws IOException;  /**    * Deserialize the fields of this object from in.     *    * 

For efficiency, implementations should attempt to re-use storage in the * existing object where possible.

* * @param in DataInput to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException;}

Writable class

Hadoop itself provides a variety of specific Writable classes, including common basic Java types (boolean, byte, short, int, float, long, and double) and set type (BytesWritable, ArrayWritable and MapWritable ). These types are in the org. apache. hadoop. io package.

(Image Source: safaribooksonline.com)

Custom Writable class

Although Hadoop has a variety of built-in Writable classes for users to choose from, Hadoop implements the RawComparable interface for the Writable class of Java basic types, so that these objects do not need to be deserialized, you can sort byte streams, which greatly reduces the time overhead. However, when we need more complex objects, the built-in Writable class of Hadoop cannot meet our needs (note that the Writable set type provided by Hadoop does not implement the RawComparable interface, so it does not meet our needs ), in this case, we need to customize our own Writable class, especially when using it as a key, so as to achieve more efficient storage and fast comparison.

The following example shows how to customize a Writable class. A customized Writable class must first implement the Writable or WritableComparable interface, and then write (DataOutput out) for the customized Writable class) and the readFields (DataInput in) method to control how the custom Writable class is converted to a byte stream (write method) and how to transfer back from byte to a Writable object.

package com.yoyzhou.weibo;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.VLongWritable;import org.apache.hadoop.io.Writable;/** *This MyWritable class demonstrates how to write a custom Writable class  * **/public class MyWritable implements Writable{private VLongWritable field1;private VLongWritable field2;public MyWritable(){this.set(new VLongWritable(), new VLongWritable());}public MyWritable(VLongWritable fld1, VLongWritable fld2){this.set(fld1, fld2);}public void set(VLongWritable fld1, VLongWritable fld2){//make sure the smaller field is always put as field1if(fld1.get() <= fld2.get()){this.field1 = fld1;this.field2 = fld2;}else{this.field1 = fld2;this.field2 = fld1;}}//How to write and read MyWritable fields from DataOutput and DataInput stream@Overridepublic void write(DataOutput out) throws IOException {field1.write(out);field2.write(out);}@Overridepublic void readFields(DataInput in) throws IOException {field1.readFields(in);field2.readFields(in);}/** Returns true if o is a MyWritable with the same values. */@Overridepublic boolean equals(Object o) { if (!(o instanceof MyWritable))    return false;    MyWritable other = (MyWritable)o;    return field1.equals(other.field1) && field2.equals(other.field2);}@Overridepublic int hashCode(){return field1.hashCode() * 163 + field2.hashCode();}@Overridepublic String toString() {return field1.toString() + "\t" + field2.toString();}}

To be continued, the next article will introduce the length of the byte occupied when the Writable object is serialized as a byte stream and the composition of its byte sequence.

References

Tom White, Hadoop: The Definitive Guide, 3rd Edition

---To Be Continued---

Original article address: Hadoop serialization and Writable interface (1). Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.