Reprinted please indicate Source Address: http://blog.csdn.net/lastsweetop/article/details/9249411
All source code on GitHub, https://github.com/lastsweetop/styhadoop
Introduction In hadoop, writable implementation classes are a huge family. Here we will briefly introduce some of the commonly used implementations for serialization.
Except for char, all Java Native types have corresponding writable classes, and their values can be obtained through get and set methods. Intwritable and longwritable also have the corresponding variable-length vintwritable and vlongwritable classes. The fixed length or variable length options are similar to Char or vchar in the database. We will not go into details here.
The text type uses the extended int storage length, so the maximum storage of the text type is 2 GB. the text type adopts the standard UTF-8 encoding, so it can interact well with other text tools, but note that there is a lot of difference between the string type and Java. The chatat of different texts to be retrieved returns an integer and UTF-8 encoded number, rather than a Unicode-encoded char type like a string.
@Test public void testTextIndex(){ Text text=new Text("hadoop"); Assert.assertEquals(text.getLength(), 6); Assert.assertEquals(text.getBytes().length, 6); Assert.assertEquals(text.charAt(2),(int)'d'); Assert.assertEquals("Out of bounds",text.charAt(100),-1); }
Text also has a find method, similar to the indexof method in string
@Test public void testTextFind() { Text text = new Text("hadoop"); Assert.assertEquals("find a substring",text.find("do"),2); Assert.assertEquals("Find first 'o'",text.find("o"),3); Assert.assertEquals("Find 'o' from position 4 or later",text.find("o",4),4); Assert.assertEquals("No match",text.find("pig"),-1); }
Unicode differencesWhen the bytes encoded by the uft-8 are greater than two, the difference between the text and the string is clearer, because the string is calculated according to the Unicode char, and the text is calculated according to the bytes. We can see that four Unicode characters of 1 to four bytes occupy 1 to four bytes respectively, and U + 10400 occupies two char again in the Unicode Character of Java, the first three characters occupy 1 char respectively. We can see the differences between string and text in the code.
@Test public void string() throws UnsupportedEncodingException { String str = "\u0041\u00DF\u6771\uD801\uDC00"; Assert.assertEquals(str.length(), 5); Assert.assertEquals(str.getBytes("UTF-8").length, 10); Assert.assertEquals(str.indexOf("\u0041"), 0); Assert.assertEquals(str.indexOf("\u00DF"), 1); Assert.assertEquals(str.indexOf("\u6771"), 2); Assert.assertEquals(str.indexOf("\uD801\uDC00"), 3); Assert.assertEquals(str.charAt(0), '\u0041'); Assert.assertEquals(str.charAt(1), '\u00DF'); Assert.assertEquals(str.charAt(2), '\u6771'); Assert.assertEquals(str.charAt(3), '\uD801'); Assert.assertEquals(str.charAt(4), '\uDC00'); Assert.assertEquals(str.codePointAt(0), 0x0041); Assert.assertEquals(str.codePointAt(1), 0x00DF); Assert.assertEquals(str.codePointAt(2), 0x6771); Assert.assertEquals(str.codePointAt(3), 0x10400); } @Test public void text() { Text text = new Text("\u0041\u00DF\u6771\uD801\uDC00"); Assert.assertEquals(text.getLength(), 10); Assert.assertEquals(text.find("\u0041"), 0); Assert.assertEquals(text.find("\u00DF"), 1); Assert.assertEquals(text.find("\u6771"), 3); Assert.assertEquals(text.find("\uD801\uDC00"), 6); Assert.assertEquals(text.charAt(0), 0x0041); Assert.assertEquals(text.charAt(1), 0x00DF); Assert.assertEquals(text.charAt(3), 0x6771); Assert.assertEquals(text.charAt(6), 0x10400); }
This comparison is obvious.
1. The length () method of string returns the number of char, And the getlength () method of text returns the number of bytes. 2. The indexof () method of string returns the offset of char as the unit, and the find () method of text returns the offset in bytes. 3. the string charat () method does not return the entire UNICODE character, but returns the char character 4 in Java. the codepointat () method of string is similar to the charat method of text, but note that the former is the offset of char and the latter is the offset of byte.
In text, the iteration of Unicode characters is quite complex. Because it is related to the number of bytes occupied by Unicode, it cannot be determined simply by the growth of index. First, we need to encapsulate the Text object using bytebuffer, and then call the static method bytestocodepoint of text to poll bytebuffer to return the code point of Unicode characters. Let's take a look at the sample code:
Package COM. sweetop. styhadoop; import Org. apache. hadoop. io. text; import Java. NIO. bytebuffer;/*** created with intellij idea. * User: lastsweetop * Date: 13-7-9 * Time: * to change this template use file | Settings | file templates. */public class textiterator {public static void main (string [] ARGs) {text = new text ("\ u0041 \ u00df \ u6771 \ ud801 \ udc00 "); bytebuffer buffer = bytebuffer. wrap ( Text. getbytes (), 0, text. getlength (); int CP; while (buffer. hasremaining () & (Cp = text. bytestocodepoint (buffer ))! =-1) {system. Out. println (integer. tohexstring (CP ));}}}
Except nullwritable, the modification of text can be modified for other types of writable. You can use the set method of text to modify and reuse the instance.
@Test public void testTextMutability() { Text text = new Text("hadoop"); text.set("pig"); Assert.assertEquals(text.getLength(), 3); Assert.assertEquals(text.getBytes().length, 3); }
Note that, in some cases, the length of the byte array returned by the getbytes method of text is different from that returned by the getlength method of text. Therefore, it is best to call the getlength method while calling the getbytes () method, so that you can know the number of valid characters in the byte array.
@Test public void testTextMutability2() { Text text = new Text("hadoop"); text.set(new Text("pig")); Assert.assertEquals(text.getLength(),3); Assert.assertEquals(text.getBytes().length,6); }
Byteswritable bytewritable is the encapsulation type of a binary array. the serialization format is a 4-byte INTEGER (different from text, text starts with a variable-length INT) to indicate the length of the byte array, then the array itself is used. See the following example:
@Test public void testByteWritableSerilizedFromat() throws IOException { BytesWritable bytesWritable=new BytesWritable(new byte[]{3,5}); byte[] bytes=SerializeUtils.serialize(bytesWritable); Assert.assertEquals(StringUtils.byteToHexString(bytes),"000000020305"); }
Like text, bytewritable can also be modified using the set method. The size returned by getlength is the actual size, while that returned by getbytes is indeed not.
bytesWritable.setCapacity(11); bytesWritable.setSize(4); Assert.assertEquals(4,bytesWritable.getLength()); Assert.assertEquals(11,bytesWritable.getBytes().length);
The nullwritable type nullwritable is a special writable type. serialization does not contain any characters and is only equivalent to a placeholder. When you use mapreduce, the key or value can be defined as nullwritable when it is not needed.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. io. nullwritable; import Org. apache. hadoop. util. stringutils; import Java. io. ioexception;/*** created with intellij idea. * User: lastsweetop * Date: 13-7-16 * Time: * to change this template use file | Settings | file templates. */public class testnullwritable {public static void main (string [] ARGs) throws ioexception {nullwritable = nullwritable. get (); system. out. println (stringutils. bytetohexstring (serializeutils. serialize (nullwritable )));}}
Objectwritable type
Objectwritable is another type of encapsulation class, including the Java Native type, String, Enum, writable, null, etc., or an array composed of these types. When a field has multiple types, the objectwritable type can be used. However, a bad thing is that it occupies too much space, even if you save a letter, because it needs to save the type before encapsulation, let's look at the blind example:
Package COM. sweetop. styhadoop; import Org. apache. hadoop. io. objectwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. util. stringutils; import Java. io. ioexception;/*** created with intellij idea. * User: lastsweetop * Date: 13-7-17 * Time: am * to change this template use file | Settings | file templates. */public class testobjectwritable {public static void main (string [] ARGs) throws ioexception {text = new text ("\ u0041"); objectwritable = new objectwritable (text ); system. out. println (stringutils. bytetohexstring (serializeutils. serialize (objectwritable )));}}
Just save a letter, let's take a look at the serialized result:
00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141
It is a waste of space, and the type is generally known, that is, a few, so its replacement method appears, look at the next section
When the genericwritable type uses genericwritable, you only need to inherit it and specify which types need to be supported by rewriting the gettypes method. Let's take a look at the usage:
package com.sweetop.styhadoop;import org.apache.hadoop.io.GenericWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;class MyWritable extends GenericWritable { MyWritable(Writable writable) { set(writable); } public static Class<? extends Writable>[] CLASSES=null; static { CLASSES= (Class<? extends Writable>[])new Class[]{ Text.class }; } @Override protected Class<? extends Writable>[] getTypes() { return CLASSES; //To change body of implemented methods use File | Settings | File Templates. }}
Then output the serialized result.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. io. vintwritable; import Org. apache. hadoop. util. stringutils; import Java. io. ioexception;/*** created with intellij idea. * User: lastsweetop * Date: 13-7-17 * Time: * to change this template use file | Settings | file templates. */public class testgenericwritable {public static void main (string [] ARGs) throws ioexception {text = new text ("\ u0041 \ u0071 "); mywritable = new mywritable (text); system. out. println (stringutils. bytetohexstring (serializeutils. serialize (text); system. out. println (stringutils. bytetohexstring (serializeutils. serialize (mywritable )));}}
The result is:
02417100024171
The serialization of genericwritable only places the index of the type in the type array in front, which saves a lot of space than objectwritable. Therefore, we recommend that you use genericwritable.
Writablearraywritable and twodarraywritablearraywritable of the set type indicate the writable type of the array and the two-dimensional array respectively. The specified array type can be set in the constructor or inherited from arraywritable, the same is true for twodarraywritable.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. io. arraywritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. io. writable; import Org. apache. hadoop. util. stringutils; import Java. io. ioexception;/*** created with intellij idea. * User: lastsweetop * Date: 13-7-17 * Time: 11 am * to change this template use file | Settings | file templates. */public class testarraywritable {public static void main (string [] ARGs) throws ioexception {arraywritable = new arraywritable (text. class); arraywritable. set (New writable [] {new text ("\ u0071"), new text ("\ u0041")}); system. out. println (stringutils. bytetohexstring (serializeutils. serialize (arraywritable )));}}
Check the output:
0000000201710141
It can be seen that arraywritable represents the array length starting with an integer, and then the elements in the array are exclusive one by one.
Arrayprimitivewritable is similar to the preceding one, but does not need to inherit arraywritable by subclass. Mapwritable and sortedmapwritablemapwritable correspond to map and sortedmapwritable correspond to sortedmap, which starts with four bytes and stores the set size. Then each element starts with one byte and stores indexes of the type (similar to genericwritable, so the total number of types can only be inverted 127), followed by the element itself, first key and then value, so that the pair is exclusive. The two writable instances will be used in the future and run through hadoop. No example will be written here. We didn't see the set and list sets. This can be used in place of the implementation. Replace set with mapwritable and sortedmapwritable with sortedmap. You only need to set their values to nullwritable. nullwritable does not occupy space. A list composed of the same types can be replaced by arraywritable. Different types of lists can be implemented using genericwritable, and then encapsulated using arraywritable. Of course, mapwritable can implement the list, set the key as the index, and values is used as the element in the list.