Iso-8859-1, ASCII, GBK, GB 2312 Character Set analysis

Source: Internet
Author: User

In the programming aspect frequently encounters the question which the character encodes, because does not have a system understanding to the character set, is always garbled to make the confused, this blog post is to the character coding aspect to carry on the collation, in order to review later. In the process of learning the character set, I mainly from the character set (a) encoding, (b) takes up bytes, two aspects to analyze.

Iso-8859-1/ascii

Reference: Iso-8859-1

Iso-8859-1 (Latin1) encoding is single byte encoded , backwards compatible with ASCII, whose encoding range is 0x00-0xff,0x00-0x7f between full and ASCII, 0x80-0x9f is a control character, 0xa0-0xff is a text symbol. Because the ISO-8859-1 encoding range uses all the space within a single byte, the stream and storage of any other encoded byte stream in a system that supports iso-8859-1 is not discarded. In other words, it is no problem to treat any other coded byte stream as iso-8859-1 encoded. Encoding table for iso-8859-1 character set (including ASCII character set, image from encyclopedia), encoding method

In the code below, the first three characters of the string str "ÚÙ§ABD" are not in the ASCII encoding range, so the variable ASC cannot be reverted to the source string. In terms of the length of the byte array, iso-8859-1 and ASCII are single-byte encodings .

1  Public Static voidISO () {2String str = "ÚÙ§ABD";3     Try {4         byte[] ch = str.getbytes ("Iso-8859-1");5String ASC =NewString (CH, "ASCII");6String ISO =NewString (CH, "iso-8859-1");7System.out.println (str+ "Length:" +ch.length+ "bytecode:" +byte2hex (CH) + "\nascii:" +asc + "\niso-8859-1:" +ISO);8}Catch(unsupportedencodingexception e) {9         //TODO auto-generated Catch blockTen e.printstacktrace (); One     } A}

OUTPUT:

Úù§abd length:6 Bytecode:fa F9 A7 41 42 44
Ascii:??? Abd
Iso-8859-1:úù§abd

gbk/gb2312

Reference: GB 2312,GBK

The GB 2312 standard contains 6,763 Chinese characters, of which 3,755 characters are in the first level, two Chinese characters 3,008, and GB 2312 contains 682 full-width character including Latin alphabet, Greek alphabet, Japanese hiragana and katakana letters, Russian Cyrillic alphabet. The advent of GB 2312, basically meet the needs of the computer processing of Chinese characters, it contains Chinese characters have covered 99.75% of the frequency of use in mainland China. For people's names, ancient Chinese and other aspects of the rarely used word, GB 2312 can not be processed, which led to the subsequent GBK and GB 18030 character set appearance.  
   Chinese character Location Code: "Partition" of the received Chinese characters in GB 2312, with 94 kanji/symbols per zone.  Each character can use a 4-bit decimal representation, called the location Code, the first two is the area code, the last two bits are bit codes.  Zone 01-09 is a special symbol.  Area 16-55 is a first-level kanji, sorted by pinyin.  The 56-87 district is a two-level Chinese character, sorted by radical/stroke.  10-15 Districts and 88-94 districts are not encoded. For example, the word "ah" is the first character in GB2312, and its location code is 1601. byte encoding method:In programs that use GB2312, the EUC storage method is usually used (area code and bit code are added 0xa0 respectively) to be compatible with ASCII. Each character and symbol is represented by two bytes.  The first byte is called the "high Byte" (also known as the "region byte"), and the second byte is called the "low Byte" (also known as the "bit byte"). "High byte" uses 0xa1-0xf7 (the area code of Zone 01-87 plus 0xA0), "Low byte" uses 0xa1-0xfe (01-94 plus 0xA0). Since the first level of Chinese characters from the beginning of the 16 district, the Chinese character area "high-byte" range is 0xb0-0xf7, "low-byte" range is 0xa1-0xfe, occupies the code point is 72*94=6768 (72 kanji partition).  5 of these seats are d7fa-d7fe. For example, in most programs, the word "ah" is stored in two bytes, 0xb0 (the first byte) 0xa1 (the second byte).

GBK full Name "Chinese character code extension code", GBK code, is the GB2312-80 standard based on the internal code extension specification, using a double-byte encoding scheme, its encoding range from 8140 to Fefe (excluding xx7f), a total of 23,940 code bits, A total of 21,003 Chinese characters, fully compatible with the GB2312-80 standard, supporting the international standard ISO/IEC10646-1 and national standards gb13000-1 all CJK Chinese characters, and contains BIG5 encoding all Chinese characters.

1  Public Static voidGB () {2String str = "Ah AA";3     byte[] ch;4     Try {5ch = str.getbytes ("GB2312");6SYSTEM.OUT.PRINTLN ("ch Length:" +ch.length+ "bytecode:" +Byte2hex (CH));7ch = str.getbytes ("GBK");8SYSTEM.OUT.PRINTLN ("ch Length:" +ch.length+ "bytecode:" +Byte2hex (CH));9}Catch(unsupportedencodingexception e) {Ten         //TODO auto-generated Catch block One e.printstacktrace (); A     } -}

OUTPUT:

CH Length:4 bytecode:b0 A1 61 41
CH Length:4 bytecode:b0 A1 61 41

From the program you can see that GB2312,GBK is not long, the kanji is 2 bytes, the English character is a byte. Because the first bit of the "high byte" representing the Chinese character or graphic symbol is 1, and the ASCII first bit is 0, the ASCII compatibility of both character sets is achieved.

Utf-8/unicode

Iso-8859-1, ASCII, GBK, GB 2312 Character Set analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.