Chinese to Unicode, Chinese to Bytes,unicode to bytes Java implementation

Source: Internet
Author: User

Utf-8

The Chinese in utf-8 format is made up of three-bit bytes.

The coding rules for UTF-8 are simple, with only two lines:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.
The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.
Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

1. Chinese to Unicode

     Public Static string Tounicode (string s) {        new  string[s.length ()];         = "";          for (int i = 0; i < s.length (); i++) {            = integer.tohexstring (S.charat (i) & 0xFFFF); 
    = s1 + "\\u" + as[i];        }         return s1;    }

2. Chinese to bytes

byte [] B=s.getbytes ("Utf-8");

3. Unicode Transfer bytes

/** Unicode goes to utf-8 conversion process. * @param kanji to be converted * @return 16 binary UTF-8 encoded byte sequence/*/     Public StaticString Unicode2utf8 (Charinput) {        //1 byte=8byte 16-bit value range 00~FF//input Two byte 16-bit value range is 4e00~9fa5        intLowbyte = input & 0x00ff; intHighbyte = (input & 0xff00) >>> 8; //the 1th byte of UTF-8 is 1110 + highbyte high 4-bit        intHigh4inhighbyte = (highbyte& 0xf0) >>> 4; intUtf8byte1 = (7 << 5) +High4inhighbyte; //the 2nd byte of the UTF-8 is the 2-bit low 4-bit + lowbyte high-highbyte        intLow4inhighbyte = highbyte & 0x0f; intHigh2inlowbyte = (lowbyte& 0xc0) >>> 6; intUtf8byte2 = (1 << 7) + (Low4inhighbyte << 2) +High2inlowbyte; //the 3rd byte of UTF-8 is 6 bits lower than Lowbyte        intUtf8byte3 = (1 << 7) + (Lowbyte & 0x3f); String result= Integer.tohexstring (utf8byte1) + "," + integer.tohexstring (utf8byte2) + "," +integer.tohexstring (utf8byte3); returnresult; }

Gbk

The GBK encoding is an extension of the GB2312 encoding and is therefore fully compatible with the GB2312-80 standard. GBK encoding is still using a double-byte encoding scheme, its encoding range: 8140-fefe, eliminate xx7f code bit, a total of 23,940 code bits. A total of 21,886 Chinese characters and graphic symbols, including Chinese characters (including radicals and components) 21,003, graphic symbols of 883. GBK encoding supports all CJK characters in the International standard ISO/IEC10646-1 and national standards Gb13000-1, and contains all the Chinese characters in BIG5 encoding. The GBK coding scheme was officially released on December 15, 1995, and the GBK specification for this edition is version 1.0.

GBK also uses double-byte representation, the overall encoding range is 8140-fefe, the first byte between the 81-fe, the tail byte between 40-fe, culling xx7f a line. A total of 23,940 code positions, a total income of 21,886 Chinese characters and graphic symbols, including Chinese characters (including radicals and components) 21,003, graphic symbols 883.

All encodings are divided into three parts:

1. Chinese character area. Including:
A. GB 2312 Kanji area. namely GBK/2: B0a1-f7fe. Contains 6,763 GB of 2312 Kanji, arranged in the original order.
B. GB 13000.1 expands the Chinese character area. Including:
(1) GBK/3:8140-a0fe. Contains 6,080 CJK kanji in GB 13000.1.
(2) GBK/4: aa40-fea0. CJK Chinese characters and supplemental kanji are included in 8,160. CJK Chinese characters in the front, by the UCS code size, the addition of Chinese characters (including radicals and components) in the following, according to the "Kangxi Dictionary" page number/word rank.
(3) The Chinese character "0" is arranged in the graphic symbol area gbk/5:a996.

2. Graphic symbol area. Including:
A. GB 2312 non-kanji symbol area. namely GBK/1: A1a1-a9fe. In addition to the symbol of GB 2312, there are 10 lowercase roman numerals and GB 12345 supplemental symbols. Count the symbols 717.
B. GB 13000.1 expands the non-Chinese character area. namely GBK/5: A840-a9a0. BIG-5 Non-kanji symbols, structural characters, and "0" are arranged in this area. Count the symbols 166.

3. User-defined area: divided into (1) (2) (3) three districts.
(1) Aaa1-affe, code bit 564.
(2) F8a1-fefe, code bit 658.
(3) a140-a7a0, code bit 672.
Section (3), although open to users, is restricted, as it does not preclude the possibility of future additions of new characters in this area.

Example:

        String s= "Chinese";                 byte [] B=s.getbytes ("GBK");

Bytes for

[-42,-48,-50,-60]

The complement is:

"214,208,206,196"

Convert to 16 progress for

"D6,d0,ce,c4"

Follow the GBK table to query Chinese characters

The encoding matches exactly.

Chinese to Unicode, Chinese to Bytes,unicode to bytes Java implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.