Serialization and writable classes in Hadoop

Source: Internet
Author: User
Tags string indexof

This article address: http://www.cnblogs.com/archimedes/p/hadoop-writable-class.html, reprint please indicate source address.

The Org.apache.hadoop.io package in Hadoop comes with a wide range of writable classes to choose from, which form the hierarchical structure shown:

Java basic types of writable wrappers

The writable class provides encapsulation of Java primitives, except for short and char, all packages contain get () and set () two methods for reading or setting the encapsulated value

Java basic types of writable classes

Java Native type

Except for char types, all native types have corresponding writable classes, and their values can be obtained through the get and set methods. Intwritable and longwritable also have corresponding variable-length vintwritable and vlongwritable classes. Fixed length or variable long selection similar to the database of char or VCHAR, here will not repeat.

Text type

The text type uses the variable-length int storage length, so the maximum storage for the text type is 2G.

The text type uses the standard utf-8 encoding, so it can interact very well with other text tools, but it's important to note that this is a lot different from the Java string type.

Search for different

The Chatat of text returns an integer, and utf-8 encoded number, instead of a Unicode-encoded char type like string.

 Public void Testtextindex () {        text text=new text ("Hadoop");         6);         6);        Assert.assertequals (Text.charat (2), (int) ' d ');        Assert.assertequals ("Outof Bounds", Text.charat ( -1);}

Text also has a Find method, similar to the IndexOf method in string:

 Public void Testtextfind () {        new Text ("Hadoop");        Assert.assertequals ("Find a substring", Text.find ("Do"), 2);        Assert.assertequals ("Find first ' o '", Text.find ("O"), 3);        Assert.assertequals ("Find ' O ' from position 4 or later", Text.find ("O", 4), 4);        Assert.assertequals ("No match", Text.find ("Pig"), -1);}
differences in Unicode

When the UFT-8 encoded byte is greater than two, the difference between text and string is clearer because the string is computed in Unicode char and the text is computed in bytes. Let's take a look at 1 to 4 bytes of different Unicode characters

4 Unicode occupies 1 to 4 bytes, u+10400 Unicode characters in Java take up two Char, and the first three characters occupy 1 char respectively.

We look at the difference between string and text through the code.

ImportJava.io.*;ImportOrg.apache.hadoop.io.*;Importorg.apache.hadoop.util.StringUtils;ImportJunit.framework.Assert; Public classtextandstring { Public Static voidString ()throwsunsupportedencodingexception {String str= "\U0041\U00DF\U6771\UD801\UDC00"; Assert.assertequals (Str.length (),5); Assert.assertequals (Str.getbytes ("UTF-8"). Length, 10); Assert.assertequals (Str.indexof ("\u0041"), 0); Assert.assertequals (Str.indexof ("\U00DF"), 1); Assert.assertequals (Str.indexof ("\u6771"), 2); Assert.assertequals (Str.indexof ("\ud801\udc00"), 3); Assert.assertequals (Str.charat (0), ' \u0041 '); Assert.assertequals (Str.charat (1), ' \U00DF '); Assert.assertequals (Str.charat (2), ' \u6771 '); Assert.assertequals (Str.charat (3), ' \ud801 '); Assert.assertequals (Str.charat (4), ' \udc00 '); Assert.assertequals (Str.codepointat (0), 0x0041); Assert.assertequals (Str.codepointat (1), 0X00DF); Assert.assertequals (Str.codepointat (2), 0x6771); Assert.assertequals (Str.codepointat (3), 0x10400); }                 Public Static voidtext () {text text=NewText ("\u0041\u00df\u6771\ud801\udc00"); Assert.assertequals (Text.getlength (),10); Assert.assertequals (Text.find ("\u0041"), 0); Assert.assertequals (Text.find ("\U00DF"), 1); Assert.assertequals (Text.find ("\u6771"), 3); Assert.assertequals (Text.find ("\ud801\udc00"), 6); Assert.assertequals (Text.charat (0), 0x0041); Assert.assertequals (Text.charat (1), 0X00DF); Assert.assertequals (Text.charat (3), 0x6771); Assert.assertequals (Text.charat (6), 0x10400); }     Public Static voidMain (string[] args) {//TODO auto-generated Method Stubtext (); Try{string (); } Catch(Unsupportedencodingexception ex) {} }}

Such a comparison is obvious.

The length () method of 1.String returns the number of char, and the text's GetLength () method returns the number of bytes.

The 2.String indexof () method returns the offset in char as a unit, and the text find () method returns the offset in bytes.

The 3.String Charat () method is not the entire Unicode character returned, but instead returns the char character in Java

The Codepointat () of 4.String is similar to the Charat method of text, but note that the former is offset by Char, which is the offset of the byte

Iteration of text

The iteration of Unicode characters in text is quite complex because it is not easy to use the growth of index to determine the number of bytes that are associated with Unicode. The first thing is to convert the text object to a Java.nio.ByteBuffer object, and then use the buffer to repeatedly call the Bytestocodepoint method on the text object, which can get the position of the next generation code, return the corresponding int value, and finally update the position in the buffer. The Bytestocodepoint () method allows you to detect the end of a string and return a value of 1. Take a look at the sample code:

ImportJava.io.*;ImportJava.nio.ByteBuffer;ImportOrg.apache.hadoop.io.*;Importorg.apache.hadoop.util.StringUtils;ImportJunit.framework.Assert; Public classtextandstring { Public Static voidMain (string[] args) {//TODO auto-generated Method StubText T =NewText ("\u0041\u00df\u6771\ud801\udc00"); Bytebuffer buf= Bytebuffer.wrap (t.getbytes (), 0, T.getlength ()); intCP;  while(Buf.hasremaining () && (cp = Text.bytestocodepoint (BUF))! =-1) {System.out.println (integer.tohexstring (CP)); }    }}

Operation Result:

41
Df
6771
10400

Serialization and writable classes in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.