Serialization of Hadoop files

Source: Internet
Author: User
Tags comparable

Directory

1. Why serialize?

2. What is serialization?

3. Why not use Java serialization?

4. Why is serialization important for Hadoop?

5. What serialization-related interfaces are defined in Hadoop?

6. Hadoop Custom Writable interface

1. Why serialize?

In general, "live" objects exist only in memory, power off the shutdown is not. and "Live" objects can only be used by local processes and cannot be sent to another computer on the network. serialization, however, can store "live" objects that can send "live" objects to a remote computer.

2. What is serialization?

Serialization is the conversion of an object (instance) into a byte stream (an array of characters). Deserialization is the inverse process of converting a byte stream into an object. So, if you want to store the "live" object to a file, store this string of bytes, if you want to send the "live" object to the remote host, send this string of bytes, when the object is needed, do the deserialization, you can "revive" the object.

The object is serialized to a file, and the term is called "persisted." The object is serialized to a remote computer, and the term "data communication" is called.

3. Why not use Java serialization?

The disadvantage of the serialization mechanism of Java is that the computational overhead is large, and the result of serialization is too large, and sometimes it can reach several times or even 10 times times the object size. Its reference mechanism can also cause problems where large files cannot be split. These drawbacks make the Java serialization mechanism inappropriate for Hadoop. Hadoop then designed its own serialization mechanism.

4. Why is serialization important for Hadoop?

Because Hadoop requires serialization when communicating between clusters or RPC calls, it requires serialization to be fast, small in size, and less bandwidth-intensive. So you have to understand the serialization mechanism of Hadoop.

Serialization and deserialization often occur in the field of distributed data processing: Process communication and persistent storage. However, the communication of the nodes in Hadoop is implemented by remote call (RPC), and the RPC serialization requirements have the following characteristics:
Compact: Compact format enables us to take full advantage of network bandwidth, the most scarce resource in the data center
Fast: Process communication forms the skeleton of a distributed system, so it is essential to minimize the performance overhead of serialization and deserialization.
Extensible: Protocol in order to meet the new requirements change, so control client and server process, need to directly introduce the corresponding protocol, these are new protocols, the original serialization mode can support the new Protocol message
Interoperability: Enables client and server-side interaction in different languages

5. What serialization-related interfaces are defined in Hadoop?

There are two serialization-related interfaces defined in Hadoop: The writable interface and the comparable interface, which can be combined into one interface writablecomparable

Let's look at these two serialized interfaces:

    • Writable interface

All classes that implement the writable interface can be serialized and deserialized. The writable interface defines two methods, write (DataOutput out) and ReadFields (Datainput in), respectively. Write writes the object state to the DataOutput stream in binary format, ReadFields is used to read the object state from the Datainput stream in the binary format.

1  PackageOrg.apache.hadoop.io;2 3 ImportJava.io.DataOutput;4 5 ImportJava.io.DataInput;6 7 Importjava.io.IOException;8 9 Importorg.apache.hadoop.classification.InterfaceAudience;Ten  One Importorg.apache.hadoop.classification.InterfaceStability; A  -  Public InterfaceWritable { -     /** the  - * Convert object to byte stream and write to output stream out -  -     */ +  -     voidWrite (DataOutput out)throwsIOException; +      A     /** at      - * Read byte stream from input stream in to deserialize object -      -     */ -  -     voidReadFields (Datainput in)throwsIOException; in}

What can we do with a particular writable?

There are two common operations: assignment and value, here we take intwritable as an example to illustrate (intwritable is the encapsulation of the int type of Java)

1) Set the value of the intwritable through the set () function

Intwritable value = new intwritable ();

Value.set (588)

Similarly, you can use constructors to assign values.

Intwritable value = new intwritable (588);

2) Get the value of the intwritable by using the Get () function.

int result = Value.get ();//The value obtained here is 588

    • Comparable interface

All objects that implement comparable can compare size to objects of the same type as themselves. The interface is defined as:

1  PackageJava.lang;2 3 ImportJava.util.*;4 5  Public InterfaceComparable<t> {6     /**7 * Compare this object with object o, convention: Return negative number is less than, 0 is greater than, Integer is greater than8     */9      Public intcompareTo (T o);Ten}

6. Hadoop Custom Writable interface

Although Hadoop comes with a series of writable implementations, such as intwritable,longwritable, it can satisfy some simple data types. Sometimes, however, complex data types require their own custom implementations. By customizing the writable, you have full control over the binary representation and sort order.

Existing Hadoop writable applications have been well optimized, but in order to deal with more complex structures, it is best to create a new writable type instead of using existing types. Let's take a look at how to customize the writable type to customize a writable type Textpair as an example, as shown below

1 ImportJava.io.*;2 3 ImportOrg.apache.hadoop.io.*;4 5 /** 6 * @ProjectName Serialize7 * @ClassName Textpair8 * @Description Custom writable type Textpair9 * @Author Liu JishuTen * @Date 2016-04-16 23:59:19 One */ A  Public classTextpairImplementsWritablecomparable<textpair> { -     //instance variables of type Text -     PrivateText first; the     //instance variables of type Text -     PrivateText second; -      -      PublicTextpair () { +SetNewText (),NewText ()); -     } +  A      PublicTextpair (string First, string second) { atSetNewText (first),NewText (second)); -     } -  -      PublicTextpair (text first, text second) { - set (first, second); -     } in  -      Public voidSet (text first, text second) { to          This. First =First ; +          This. Second =second; -     } the  *      PublicText GetFirst () { $         returnFirst ;Panax Notoginseng     } -  the      PublicText Getsecond () { +         returnsecond; A     } the      + @Override -     //converts an object to a byte stream and writes to the output stream out $      Public voidWrite (DataOutput out)throwsIOException { $ First.write (out); - Second.write (out); -     } the      - @OverrideWuyi     //read byte stream from input stream in to deserialize object the      Public voidReadFields (Datainput in)throwsIOException { - First.readfields (in); Wu Second.readfields (in); -     } About  $ @Override -      Public inthashcode () { -         returnFirst.hashcode () * 163 +Second.hashcode (); -     } A  + @Override the      Public Booleanequals (Object o) { -         if(OinstanceofTextpair) { $Textpair TP =(Textpair) o; the             returnFirst.equals (Tp.first) &&second.equals (tp.second); the         } the         return false; the     } -  in @Override the      PublicString toString () { the         returnFirst + "\ T" +second; About     } the  the     //Sort the @Override +      Public intcompareTo (textpair tp) { -         intCMP =First.compareto (tp.first); the         if(CMP! = 0) {Bayi             returnCMP; the         } the         returnSecond.compareto (tp.second); -     } -}

The Textpair object has two text instance variables (first and second), related constructors, get methods, and set methods. All writable implementations must have a default constructor so that the MapReduce framework can instantiate them and call the ReadFields () method to populate their fields. Writable instances are variable and often reused, so you should try to avoid allocating objects in the write () or ReadFields () methods.

By delegating to each text object itself, Textpair's write () method sequentially serializes each text object in the output stream. Also, by delegating to the Text object itself, ReadFields () deserializes the bytes in the input stream. The DataOutput and Datainput interfaces have a rich set of methods for serializing and deserializing Java primitives, so in general it is possible to have full control over the data transfer format of the writable object.

As with any value object written for Java, the Java.lang.Object hashcode (), Equals (), and ToString () methods are overridden. Hashpartitioner uses the Hashcode () method to select the reduce partition, so be sure to write a good hash function to determine that the partition of the reduce function is equivalent in size.

Textpair is the implementation of writablecomparable, so it provides the implementation of the CompareTo () method, adding the sort we want: sort by one string

If you think reading this blog gives you something to gain, you might want to click " recommend " in the lower right corner.
If you want to find my new blog more easily, click on " Follow me " in the lower left corner.
If you are interested in what my blog is talking about, please keep following my follow-up blog, I am " Liu Chao ★ljc".

This article is copyright to the author and the blog Park, Welcome to reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.

Serialization of Hadoop files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.