This article address: http://www.cnblogs.com/archimedes/p/hadoop-writable-class.html, reprint please indicate source address.
The Org.apache.hadoop.io package in Hadoop comes with a wide range of writable classes to choose from, which form the hierarchical structure shown:
Java basic types of writable wrappers
The writable class provides encapsulation of Java primitives, except for short and char, all packages contain get () and set () two methods for reading or setting the encapsulated value
Java basic types of writable classes
Java Native type
Except for char types, all native types have corresponding writable classes, and their values can be obtained through the get and set methods. Intwritable and longwritable also have corresponding variable-length vintwritable and vlongwritable classes. Fixed length or variable long selection similar to the database of char or VCHAR, here will not repeat.
Text type
The text type uses the variable-length int storage length, so the maximum storage for the text type is 2G.
The text type uses the standard utf-8 encoding, so it can interact very well with other text tools, but it's important to note that this is a lot different from the Java string type.
Search for different
The Chatat of text returns an integer, and utf-8 encoded number, instead of a Unicode-encoded char type like string.
Public void Testtextindex () { text text=new text ("Hadoop"); 6); 6); Assert.assertequals (Text.charat (2), (int) ' d '); Assert.assertequals ("Outof Bounds", Text.charat ( -1);}
Text also has a Find method, similar to the IndexOf method in string:
Public void Testtextfind () { new Text ("Hadoop"); Assert.assertequals ("Find a substring", Text.find ("Do"), 2); Assert.assertequals ("Find first ' o '", Text.find ("O"), 3); Assert.assertequals ("Find ' O ' from position 4 or later", Text.find ("O", 4), 4); Assert.assertequals ("No match", Text.find ("Pig"), -1);}
differences in Unicode
When the UFT-8 encoded byte is greater than two, the difference between text and string is clearer because the string is computed in Unicode char and the text is computed in bytes. Let's take a look at 1 to 4 bytes of different Unicode characters
4 Unicode occupies 1 to 4 bytes, u+10400 Unicode characters in Java take up two Char, and the first three characters occupy 1 char respectively.
We look at the difference between string and text through the code.
ImportJava.io.*;ImportOrg.apache.hadoop.io.*;Importorg.apache.hadoop.util.StringUtils;ImportJunit.framework.Assert; Public classtextandstring { Public Static voidString ()throwsunsupportedencodingexception {String str= "\U0041\U00DF\U6771\UD801\UDC00"; Assert.assertequals (Str.length (),5); Assert.assertequals (Str.getbytes ("UTF-8"). Length, 10); Assert.assertequals (Str.indexof ("\u0041"), 0); Assert.assertequals (Str.indexof ("\U00DF"), 1); Assert.assertequals (Str.indexof ("\u6771"), 2); Assert.assertequals (Str.indexof ("\ud801\udc00"), 3); Assert.assertequals (Str.charat (0), ' \u0041 '); Assert.assertequals (Str.charat (1), ' \U00DF '); Assert.assertequals (Str.charat (2), ' \u6771 '); Assert.assertequals (Str.charat (3), ' \ud801 '); Assert.assertequals (Str.charat (4), ' \udc00 '); Assert.assertequals (Str.codepointat (0), 0x0041); Assert.assertequals (Str.codepointat (1), 0X00DF); Assert.assertequals (Str.codepointat (2), 0x6771); Assert.assertequals (Str.codepointat (3), 0x10400); } Public Static voidtext () {text text=NewText ("\u0041\u00df\u6771\ud801\udc00"); Assert.assertequals (Text.getlength (),10); Assert.assertequals (Text.find ("\u0041"), 0); Assert.assertequals (Text.find ("\U00DF"), 1); Assert.assertequals (Text.find ("\u6771"), 3); Assert.assertequals (Text.find ("\ud801\udc00"), 6); Assert.assertequals (Text.charat (0), 0x0041); Assert.assertequals (Text.charat (1), 0X00DF); Assert.assertequals (Text.charat (3), 0x6771); Assert.assertequals (Text.charat (6), 0x10400); } Public Static voidMain (string[] args) {//TODO auto-generated Method Stubtext (); Try{string (); } Catch(Unsupportedencodingexception ex) {} }}
Such a comparison is obvious.
The length () method of 1.String returns the number of char, and the text's GetLength () method returns the number of bytes.
The 2.String indexof () method returns the offset in char as a unit, and the text find () method returns the offset in bytes.
The 3.String Charat () method is not the entire Unicode character returned, but instead returns the char character in Java
The Codepointat () of 4.String is similar to the Charat method of text, but note that the former is offset by Char, which is the offset of the byte
Iteration of text
The iteration of Unicode characters in text is quite complex because it is not easy to use the growth of index to determine the number of bytes that are associated with Unicode. The first thing is to convert the text object to a Java.nio.ByteBuffer object, and then use the buffer to repeatedly call the Bytestocodepoint method on the text object, which can get the position of the next generation code, return the corresponding int value, and finally update the position in the buffer. The Bytestocodepoint () method allows you to detect the end of a string and return a value of 1. Take a look at the sample code:
ImportJava.io.*;ImportJava.nio.ByteBuffer;ImportOrg.apache.hadoop.io.*;Importorg.apache.hadoop.util.StringUtils;ImportJunit.framework.Assert; Public classtextandstring { Public Static voidMain (string[] args) {//TODO auto-generated Method StubText T =NewText ("\u0041\u00df\u6771\ud801\udc00"); Bytebuffer buf= Bytebuffer.wrap (t.getbytes (), 0, T.getlength ()); intCP; while(Buf.hasremaining () && (cp = Text.bytestocodepoint (BUF))! =-1) {System.out.println (integer.tohexstring (CP)); } }}
Operation Result:
41
Df
6771
10400
Serialization and writable classes in Hadoop