Chinese character encoding

Source: Internet
Author: User

General information

Gb2312 contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The range of Chinese characters is0xb0a1-0xf7fe5 of them are D7FA-D7FE.
GBK contains 21886 symbols, including 21003 Chinese characters and 883 other symbols. The range of Chinese characters is0x8140-0xfefe.
The Unicode Chinese character range in Windows isU + 4e00-u + 9fa5AndUS + F900-U + fa2d.
Character SetHttp://okuc.net/SoftWare/UniFonts6.0.exe

Gb2312 is the national standard of 1980 and GBK is the industrial standard of 1995.
Unicode is the encoding method, UTF-8 UTF-16LE UTF-16BE, etc for the implementation method.

Details

Chinese characters in UNICODE, gb2312, GBK, and gb18030
Http://blog.csdn.net/fmddlmyy/archive/2007/11/05/1868313.aspx

Discussion on text encoding and Unicode (I)
Http://blog.csdn.net/fmddlmyy/archive/2007/02/14/1510189.aspx

Text Encoding and Unicode (II)
Http://blog.csdn.net/fmddlmyy/archive/2007/02/14/1510193.aspx

Http://www.fmddlmyy.cn

------------------ Split line ------------------

How to distinguish between multi-byte characters (for example, Chinese characters) in a mix of Chinese and English characters?

For the ISO 8859 series single-byte encoding (sbcs), all characters are 1 byte representing 1 character.
For UTF-16LE/UTF-16BE dubyte encoding (DBCS), all characters are 2 bytes representing 1 character.
For GBK encoding (Multi-byte encoding of MBCS, 1-2 bytes ), Single-byte characters The range is 0x00-0x80 , Double-byte characters The range of the first byte is 0x81-0xfe . We read byte data in sequence, If the read bytes are between 0x81 and 0xfe, the first byte is the double byte. GBK defines that the last byte range of double-byte characters is 0x40 to 0x7e and 0x80 to 0xfe (it can be used to verify whether illegal characters exist ).
If it is UTF-8 encoding (Multi-byte encoding MBCS, 1-4 bytes)
Because
{
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
}
So
{
Take the first byte to determine whether the maximum bit is 0 (and 0x80 for calculation). If it is not 0, the Left shift determines that N binary digits are 1, and the character is n Bytes encoded. .
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.