General information
Gb2312 contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The range of Chinese characters is0xb0a1-0xf7fe5 of them are D7FA-D7FE.
GBK contains 21886 symbols, including 21003 Chinese characters and 883 other symbols. The range of Chinese characters is0x8140-0xfefe.
The Unicode Chinese character range in Windows isU + 4e00-u + 9fa5AndUS + F900-U + fa2d.
Character SetHttp://okuc.net/SoftWare/UniFonts6.0.exe
Gb2312 is the national standard of 1980 and GBK is the industrial standard of 1995.
Unicode is the encoding method, UTF-8 UTF-16LE UTF-16BE, etc for the implementation method.
Details
Chinese characters in UNICODE, gb2312, GBK, and gb18030
Http://blog.csdn.net/fmddlmyy/archive/2007/11/05/1868313.aspx
Discussion on text encoding and Unicode (I)
Http://blog.csdn.net/fmddlmyy/archive/2007/02/14/1510189.aspx
Text Encoding and Unicode (II)
Http://blog.csdn.net/fmddlmyy/archive/2007/02/14/1510193.aspx
Http://www.fmddlmyy.cn
------------------ Split line ------------------
How to distinguish between multi-byte characters (for example, Chinese characters) in a mix of Chinese and English characters?
For the ISO 8859 series single-byte encoding (sbcs), all characters are 1 byte representing 1 character.
For UTF-16LE/UTF-16BE dubyte encoding (DBCS), all characters are 2 bytes representing 1 character.
For GBK encoding (Multi-byte encoding of MBCS, 1-2 bytes ), Single-byte characters The range is 0x00-0x80 , Double-byte characters The range of the first byte is 0x81-0xfe . We read byte data in sequence, If the read bytes are between 0x81 and 0xfe, the first byte is the double byte. GBK defines that the last byte range of double-byte characters is 0x40 to 0x7e and 0x80 to 0xfe (it can be used to verify whether illegal characters exist ).
If it is UTF-8 encoding (Multi-byte encoding MBCS, 1-4 bytes)
Because
{
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
}
So
{
Take the first byte to determine whether the maximum bit is 0 (and 0x80 for calculation). If it is not 0, the Left shift determines that N binary digits are 1, and the character is n Bytes encoded. .
}