Information Transfer, encoding, and Computer Representation (3) describes Chinese encoding. This article describes unicode encoding, Unicode conversion format (UTF, Unicode translation format), and MBCS
I. UCs and Unicode
Unicode is developed based on the standard of the universal character set. UCOS uses multiple bytes for unified character encoding. All the characters in various languages exist in one encoding space. The UCS-2 is encoded in 2 bytes, And the USC-4 is encoded in 4 bytes. At present, Unicode encoding and USC-2 encoding principles are consistent.
USC-2, 2 bytes encoded, theoretically can be 2 ^ 16 = 65536 characters. The first 256 (0x0000-0x00ff) correspond to the ASCII (0x00-0xff. In fact, the Unicode of the current version is not fully filled with the 16-bit encoding, and a large amount of space is reserved for special use or future extension.
USC-4, 4 bytes encoded, is a larger 31-bit character set that has not yet been fully filled, plus the first constant of 0, a total of 32 characters, that is, 4 bytes. Theoretically, it can contain up to 231 characters, which can cover all the symbols used by languages.
4 bytes of the USC-4 can be expressed as 2 ^ 32 = 2 ^ 16*2 ^ 16 = 65536*(2 ^ 16 ). 2 ^ 16 is called a plane, so the USC-4 public 65536 planes (plane ). The first plane is the USC-2, which occupies a 16-bit encoded UNICODE character to form a basic multi-text plane (Basic multilingual plane, BMP ).
USC-2 in the USC-4, is the high two bytes is 0.
To make Unicode compatible with existing and widely used old encoding, especially for the basic Latin letters supported by almost all computer systems, therefore, the first 256 characters of Unicode are retained to the characters defined by ISO 8859-1, so that the conversion of existing western European languages does not require special consideration. For the same reason, unicode repeats a large number of identical characters into different encoding codes, so that the old and complicated encoding methods can be directly converted to the Unicode encoding without losing any information. For example, the full-angle format section contains the full-angle format of major Latin letters. These characters are displayed in Chinese, Japanese, and Korean, it is not displayed in the common half-width format, which plays an important role in vertical and equal-width text arrangement.
Ii. Unicode implementation
Unicode is implemented in a different way than encoding. The Unicode encoding of a character is definite. However, in the actual transmission process, the design of different system platforms is not necessarily consistent, and for the purpose of saving space, the implementation of Unicode encoding is different. The Unicode implementation method is calledUnicode conversion format(UNICODE translation format, UTF for short ).
For example, if a Unicode file contains only seven ASCII characters, if each character is transmitted using a 2-byte original unicode encoding, the first byte's 8-bit is always 0. This results in a great waste. In this case, you can use UTF-8 encoding, which is a variable-length encoding that uses a single byte (first 0) while still representing the Basic 7-bit ASCII characters ). When it is mixed with other Unicode characters, it will be converted according to a certain algorithm. Each character is encoded in 1-3 bytes and identified using the first 0 or 1. In this way, the length of the 7-bit ASCII document is greatly reduced (For details, referUTF-8). Similarly, for the future will appear 4 bytes of secondary Flat Characters and other UCS-4 extended characters, 2 byte encoding UTF-16 also needs to be converted through a certain algorithm.
For another example, If you directly use UTF-16 encoding that is consistent with Unicode encoding (only for BMP characters), since each character occupies two bytes, on the Macintosh (MAC) machine and PC, the understanding of byte order is inconsistent. At this time, the same byte stream may be interpreted as different content. For example, if a character is in hexadecimal format 4e59, it is split into 4E and 59 in two bytes, when reading on Mac, it starts from the low byte. in Mac OS, the 4e59 is encoded as 594e and the character found is "Kui ", in Windows, when reading from the high byte, the character encoded as U + 4e59 is "B ". That is to say, in windows, the UTF-16 encoding to save a character "B", opened in Mac OS will be displayed as "Kui ". This case indicates that the UTF-16's encoding sequence may be obfuscated if not manually defined, so the big-Endian (abbreviated as UTF-16 be) is used in the UTF-16 coding implementation method) the concept of Small-Endian, abbreviated as UTF-16 le, And the appendable bytecode solution, windows and Linux systems on PCs currently use UTF-16 le by default for UTF-16 encoding. (For specific solutions, seeUTF-16)
In addition, Unicode implementations include UTF-7, punycode, CESU-8, scsu, UTF-32, etc. These implementations are used only in a certain country and region, and some are future planning methods. At present, the general implementation methods are UTF-16 Small Tail Order (LE), UTF-16 large tail Order (be) and UTF-8. In the notepad (Notepad) attached to Microsoft's Windows XP operating system, the "Save as" dialog box can select four encoding methods to remove non-Unicode-encoded ANSI (for English systems, that is, ASCII encoding, chinese systems are gb2312 or big5 encoded, and the other three are Unicode (corresponding to the UTF-16 le), Unicode big endian (corresponding to the UTF-16 be), and UTF-8 ".
At present, the work of the secondary plane is mainly concentrated in the unified ideographic texts of China, Japan and Korea on the second and third planes, therefore, the coordination of various encodings and Unicode including GBK, gb18030, and big5 in simplified Chinese, traditional Chinese, Japanese, Korean, and Vietnamese characters has been highlighted. Considering that Unicode will eventually cover all characters, in a sense, these encoding methods can also be viewed as Unicode beforeExisting factsIn the same way as ASCII and its extension Latin-1, the first byte of the first character in the 16-bit Unicode encoding space is all 0, and the second byte encoding is exactly the same as the original encoding. However, the correspondence between the above-mentioned East Asian language encoding and Unicode encoding is much more complex.
Note: The preceding sections are referenced from http://zh.wikipedia.org/zh-cn/unicode. for more information about unicode's content, click the link.
Iii. Relationship between Unicode and gb2312, GBK, and gb18030
Unicode (UCS-2) uses 2-byte encoding for both Chinese and ASCII character sets.
Gb2312, GBK, and gb18030 use 2 bytes for Chinese characters and 1 byte for ASCII character sets.
For the same Chinese character, the Unicode encoding and GBK encoding are also different.
We call the variable-length bytes encoding method like GBK called MBCS (muilti-bytes charecter set, multi-Byte Character Set), CJK (Chinese, Japan, Korea) the MBCS encoding method is used.
<! -- Reference from http://zh.wikipedia.org/zh-cn/UTF8 below -->
4. Causes of UTF-8 and Coding Method
Convert ASCII to UCS-2, insert a 0x0 Before encoding. Using these encodings will include some controllers, such as "or '/', which may cause serious errors in UNIX and some C functions. So certainly, UCS-2 is not suitable for external unicode encoding, and thus the birth of a UTF-8.
The UTF-8 is coded in 8 bits, And the UTF-8 is not in the form of a large tail order and a small tail order, each character stored with a UTF-8, except the first byte, the first two bits of the remaining bytes start with "10", allowing the text processor to quickly find the starting position of each character.
But to be compatible with the previous ASCII code (ASCII is a byte), The UTF-8 chose to use variable length bytes to store UNICODE:
Conversion between Unicode and UTF-8
UCS-4 Coding |
UTF-8 byte stream |
U + 00000000-U + 0000007f |
0 xxxxxxx |
U+ 00000080-U + 000007ff |
110 XXXXX 10 xxxxxx |
U + 00000800-U + 0000 FFFF |
1110 XXXX 10 xxxxxx 10 xxxxxx |
U+ 00010000-U + 001 fffff |
11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U+ 00200000-U + 03 ffffff |
111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
U+ 04000000-U + 7 fffffff |
1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx |
-
- In the ASCII code range, expressed in a byte, beyond the ASCII code range is expressed in bytes, which forms the representation of the UTF-8 we see above, the advantage of the Delimiter is that when Unicode files only contain ASCII codes, the stored files are all one byte. Therefore, the common ASCII files are identical. This is also true when reading, therefore, it is compatible with the previous ASCII files.
-
- If it is greater than the ASCII code, the first few digits of the first byte above indicate the length of the Unicode character. For example, the first three digits of 110xxxxxx indicate that this is a 2 byte UNICODE character; 1110xxxx is a three-digit UNICODE character, and so on. The xxx position is filled by the binary representation of the number of characters. The closer X is to the right, the less special it has. Use only the shortest multi-byte string that is sufficient to express the number of characters encoded. Note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string ..
ASCII letters continue to be stored in 1 byte. accent characters, Greek letters, and Spanish letters are stored in 2 bytes, while commonly used Chinese characters are stored in 3 bytes. The secondary flat character is 4 bytes.
At the beginning of the UTF-8 file, many times place a U + feff character (the UTF-8 is represented by EF, BB, BF) to show that the text file is UTF-8 encoded.
<! -- End of reference -->
4. MBCS and UTF-8
Although both MBCS and UTF-8 are encoded in variable-length bytes, MBCS is an encoding method. And UTF-8 is not. The UTF-8 is only a conversion format for Unicode with a fixed length. In MBCS, two Chinese characters are used, and one ASCII character is used. In the UTF-8, Chinese uses three bytes, and ASCII uses one byte.