Encoding in Java

Source: Internet
Author: User

Source: https://www.ibm.com/developerworks/cn/java/j-lo-chinesecoding/#icomments

In the computer with the use of 0, to save the data, the storage unit is byte (8BIT/8 bit), the maximum number of bytes saved is 256, only save English can, but with the Chinese characters need to expand.

ASCII encoding

A total of 128 bits, with a byte of the low 7 bits, 0-31 is the control note newline carriage return Delete, etc., 32-126 is a printable character.

Iso-8859-1

The ISO organization, on the basis of the ASCII code, also developed a number of column standards to extend the ASCII encoding, which are iso-8859-1~iso-8859-15, where the iso-8859-1 culvert does not cover most Western European characters. The iso-8859-1 is still single-byte encoded and can represent 256 characters in total.

GB2312

The full name is called "the basic set of Chinese character encoding character set of information interchange", the double-byte encoding, the total range is A1~F7, wherein A1~A9 is the symbol area, contains 682 symbols altogether. From B0~f7 is the Chinese character area, which contains 6,763 Chinese characters.

GBK

The full name is called "Chinese character Code extension code", extended GB2312, can represent 21,003 characters, and GB2312 compatible.

GB18030

The name of the "Information interchange in Chinese character coded character set", and GB2312 compatible. National standards, but not widely used.

UTF-16

UTF-16 specifically defines the method of accessing Unicode (Universal code Uniform Code) characters in a computer. Use two bytes to represent any character, a total of 16 bits, so call UTF-16. Java takes UTF-16 as the character storage mode for memory.

UTF-8

UTF-16 Unified uses two bytes to represent a character, although convenient, but a large part of the character with a byte can be represented by two bytes now, the storage space is magnified by one times.

UTF-8 uses the variable-length technique, the following rules:

1, if a byte, the highest bit (8th bit) is 0, indicating that this is an ASCII note (00~7f). Visible, all ASCII encoding is already UTF-8.

2, if a byte, starting with 11, the number of consecutive 1 implies the number of bytes of this character, for example: 110xxxxx is the first byte of the UTF-8 character.

3. If a byte, starting with 10, indicates that it is not a first byte, it needs to be searched forward to get the first byte of the current character.

The following code: print out the encoded 16 binary

 Public Static voidMain (string[] args) {String test= "a Beijing";        System.out.println (Arrays.tostring (Test.getbytes ()));        Printhex (Test.getbytes ()); Try {            byte[] iso8859 = Test.getbytes ("Iso-8859-1");            Printhex (iso8859); byte[] gb2312 = Test.getbytes ("GB2312");             Printhex (gb2312); byte[] GBK = Test.getbytes ("GBK");             Printhex (GBK); byte[] Utf16 = Test.getbytes ("UTF-16");             Printhex (UTF16); byte[] UTF8 = test.getbytes ("UTF-8");         Printhex (UTF8); } Catch(unsupportedencodingexception e) {e.printstacktrace (); }    }         Public Static voidPrinthex (byte[] Array) {         for(byteAbyte:array) {System.out.print (integer.tohexstring (Abyte& 0xFF) + "");    } System.out.println (); }

Output Result:

1. System.out.println (Arrays.tostring (Test.getbytes ())); Default printing results for system default encoding (GBK)

97 32-79-79-66-87 (where 97 corresponds to a space in ASCII for A,32 in ASCII)

2. Printhex (Test.getbytes ()); Press 16 to print results B1 B1 be A9

3. The "iso-8859-1" encoding converts [B1 B1] to 3f, [be A9] to 3f. Iso-8859-1 is a single-byte encoding, Chinese is converted to 3f Byte, that is, "? Characters Chinese characters that are iso-8859-1 encoded will lose information and will be absorbed by a character they do not know.

4. "GB2312", the English letter is saved as 1 bytes, Chinese characters saved as two bytes North->[B1 B1], Beijing----[be A9]

5. "GBK" encoding with GB2312

6. "UTF-16", each letter or kanji is saved as two bytes.

[FE FF 0 0 4e AC] results: [FE FF] denotes Big Endian.

Big Endian: Suppose a character is represented by two bytes of 0XABCD, then stored by [AB CD] or by [CD AB] smoothly. If you press [AB CD] to store, it's called the Big Endian, and if the store presses [CD AB], it's called Little Endian.

00 61 is the character "a", [00 20] is a space, [53 17] represents "North" in UTF-16, [4e AC] is "Beijing"

7. "UTF-8" Result: E5 8c e4 ba ac

Where 61 of the binary is (0110 0001) According to the rules of UTF-8, the first is 0, which means that this is an ASCII code, a byte is represented, get "a", similarly [20] get a space.

The third byte, E5 binary, starts with a (1110 1001) of 111, indicating that this is a three-byte start. The fourth byte 8c binary (1000 1100) is expressed as a byte continuation, and the fifth byte 972 is in (1001 0111).

E5->1110 1001 Removes the first four bits that represent the beginning of the byte, and gets 1001,

8c-> 1000 1100 Removes the first two bits representing the order, gets and 001100,

97-> 1001 0111 Removes the first two bits representing the order, and gets 010111;

The 1001, 001100, 010111 combinations [1001, 0011, 0001, 0111] get 16 binary [53, 17] corresponds to the "North" in UTF-8.

Similarly, "E4 ba AC" stands for "Beijing"

In the method of printing code 16 binary

System.out.print (integer.tohexstring (Abyte & 0xFF) + "");

An tohexstring method that calls integer after a byte value with 0xFF. The reason is that if the byte value is negative, for example: -79,java byte is saved as 1 bytes, the complement result is (1011 0001,16 0xb1), and an int that is strongly converted to a negative number in Java is saved as 4 bytes (11111111 11111111 11111111 10110001,16 in 0XFFFFFFB1).

With 0xFF here, 1 of the first three bytes in int will be changed to 0, so the result will be 0xb1.

tohexstring () in integer

The tohexstring in integer, tobinarystring, tooctalstring will call the method tounsignedstring (int i, int shift), the value of the parameter shift, Tobinarystring is 1,tooctalstring, Tohexstring is 4.

private static String tounsignedstring (int i, int shift) {char[] buf = new Char[32];int Charpos = 32;int Radix = 1 << ; shift;    Shift is 4, equivalent to *2^4=8int mask = radix-1;    mask=7 (0000 1111) Do {          //I & mask get the least four bits of I are numbered, example i=25 (0001 1001), i&mask=9 (1001)//digits an         array, 9th get char[9]= ' 9 '    buf[--charpos] = digits[i & mask];         >>> moves the unsigned bit to the right and the high 0. PS:25 (0001 1001) Move right 4 bit to get 1 (0000 0001)        //>> for signed bit right shift, if positive, high up 0, if negative, high fill 1.    I >>>= shift; while (i! = 0); return new String (BUF, Charpos, (32-charpos));    }

  

Encoding in Java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.