Unicode UTF-8 gb18030 gb2312 gbk encoding comparison

Source: Internet
Author: User
Tags ranges

However, I want to understand the principle of investigation, and I want to understand all the things I care about. So I sent messages to various QQ groups in sequence, and nobody paid attention to them. Alas, depressed. Google it and teach myself. The following is a detailed introduction.

I have some personal thoughts on asking for help. Nowadays, few people have gone deep into the theory. People have been passing through the idea that people generally only know what and why. For programming, I personally think this is a very sad and dangerous thing. I think this may also be why China's IT lags behind the United States. I hope Chinese programmers can think about IT well.

The following items are found on the Internet.

Unicode encoding and implementation

Generally speaking, Unicode encoding systems can be divided into two levels: encoding mode and implementation mode.

Encoding Method

The Unicode encoding method corresponds to the general Character Set (UCS) concept of ISO 10646, and the Unicode version currently used corresponds to the UCS-2, using a 16-bit encoding space. That is, each character occupies 2 bytes. In theory, a total of 216 or 65536 characters can be entered. It basically meets the needs of various languages. In fact, the Unicode of the current version is not fully filled with the 16-bit encoding, and a large amount of space is reserved for special use or future extension.

The above 16-bit Unicode character constitutes a Basic multi-text Plane (Basic Multilingual Plane, BMP for short ). The latest (but not widely used) Unicode version defines 16 secondary planes, which together occupy at least 21 encoding spaces, slightly less than 3 bytes. But in fact, the secondary flat character still occupies 4 bytes of encoding space, consistent with the UCS-4. Future versions will expand to ISO 10646-1 implementation level 3, covering all characters of the UCS-4. The UCS-4 is a larger 31-bit character set that has not yet been fully filled, plus a constant of 0 in the first place, a total of 32-bit, that is, 4 bytes. Theoretically, it can contain up to 231 characters, which can cover all the symbols used by languages.

The Unicode encoding of BMP characters is U + hhhh, and each h Represents a hexadecimal digit. It is exactly the same as UCS-2 encoding. The corresponding 4-byte UCS-4 encoded after the two bytes are consistent, all bits of the first two bytes are 0.

For more information about Unicode, ISO 10646, and UCS, see general character set.

Implementation Method

Unicode is implemented in a different way than encoding. The Unicode encoding of a character is definite. However, in the actual transmission process, the design of different system platforms is not necessarily consistent, and for the purpose of saving space, the implementation of Unicode encoding is different. The Unicode implementation method is calledUnicode conversion format(Unicode Translation Format, UTF for short ).

For example, if a Unicode file contains only seven ASCII characters, if each character is transmitted using a 2-byte original Unicode encoding, the first byte's 8-bit is always 0. This results in a great waste. In this case, you can use UTF-8 encoding, which is a variable-length encoding that uses a single byte (first 0) while still representing the Basic 7-bit ASCII characters ). When it is mixed with other Unicode characters, it will be converted according to a certain algorithm. Each character is encoded in 1-3 bytes and identified using the first 0 or 1. In this way, the length of the 7-bit ASCII document is greatly reduced (see the UTF-8 for specific solutions ). Similarly, for the future will appear 4 bytes of secondary Flat Characters and other UCS-4 extended characters, 2 byte encoding UTF-16 also needs to be converted through a certain algorithm.

For another example, If you directly use UTF-16 encoding that is consistent with Unicode encoding (only for BMP characters), since each character occupies two bytes, on the Macintosh (Mac) machine and PC, the understanding of byte order is inconsistent. At this time, the same byte stream may be interpreted as different content. For example, if a character is in hexadecimal format 4E59, it is split into 4E and 59 in two bytes, when reading on Mac, it starts from the low byte. in Mac OS, the 4E59 is encoded as 594E and the character found is "Kui ", in Windows, when reading from the high byte, the character encoded as U + 4E59 is "B ". That is to say, in Windows, the UTF-16 encoding to save a character "B", opened in Mac OS will be displayed as "Kui ". This case indicates that the UTF-16's encoding order could BE obfuscated if not manually defined, so a large tail Order (Big-Endian, abbreviated as UTF-16 BE) is used in the UTF-16 coding implementation method), the concept of a Small Tail Order (Little-Endian, abbreviated as UTF-16 LE), and the BOM (Byte Order Mark) solution that can be appended, windows and Linux systems on PCs currently use UTF-16 LE by default for UTF-16 encoding. (For specific solutions, see UTF-16)

In addition, Unicode implementations include UTF-7, Punycode, CESU-8, SCSU, UTF-32, etc. These implementations are used only in a certain country and region, and some are future planning methods. At present, the general implementation method is UTF-16 Small Tail Order (BOM), UTF-16 large tail Order (BOM) and UTF-8. In the Notepad (Notepad) attached to Microsoft's Windows XP operating system, the "Save as" dialog box can select four encoding methods to remove non-Unicode-encoded ANSI (for English systems, that is, ASCII encoding, chinese systems are GB2312 or Big5 encoded, and the other three are Unicode (corresponding to the UTF-16 LE), Unicode big endian (corresponding to the UTF-16 BE), and UTF-8 ".

At present, the work of the secondary plane is mainly concentrated in the unified ideographic texts of China, Japan and Korea on the second and third planes, therefore, the coordination of various encodings and Unicode including GBK, GB18030, and Big5 in simplified Chinese, traditional Chinese, Japanese, Korean, and Vietnamese characters has been highlighted. Considering that Unicode will eventually cover all the characters, in a sense, these encoding methods can also be considered as Unicode's current fact implementation method, like ASCII and its extension Latin-1, the first byte of the characters in the latter two in the 16-bit Unicode encoding space is all 0, and the second byte encoding is exactly the same as the original encoding. However, the correspondence between the above-mentioned East Asian language encoding and Unicode encoding is much more complex.

UTF-8

UTF-8(8-bit Universal Character Set/Unicode Transformation Format) is a variable-length Character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes the software that originally processes ASCII characters do not need to or only need to make a few modifications, you can continue to use it. As a result, it has gradually become an application for storing or transmitting text in e-mails, web pages, and other texts.

The UTF-8 uses one to four bytes to encode each character:

  1. 128 US-ASCII characters are encoded in only one byte (Unicode ranges from U + 0000 to U + 007F ).
  2. The Latin, Greek, Spanish, Armenia, Hebrew, Arabic, Syrian, and letters with additional symbols must be encoded in two bytes (Unicode ranges from U + 0080 to U + 07FF ). ).
  3. Other characters in the basic multi-text plane (BMP) (which contains most common words) are encoded in three bytes.
  4. Other rarely used Unicode secondary Flat Characters are 4-byte encoded.

For the fourth character mentioned above, the UTF-8 uses four bytes for encoding seems too resource-consuming. But the UTF-8 can be represented in three bytes for all common characters, and its alternative, UTF-16 encoding, also requires four bytes to encode the fourth character, so decide which encoding of the UTF-8 or UTF-16 is more efficient, depending on the distribution range of the characters used. However, if some traditional compression systems such as DEFLATE are used, the differences between these different encoding systems become insignificant. The Standard Compression Scheme for Unicode (SCSU) can be used if the traditional Compression algorithm is ineffective in compressing short texts ).

The bitwise of Unicode characters is divided into several parts and assigned to the lower bitwise position in the byte string of the UTF-8. All characters below U + 0080 are encoded in a single byte containing the characters. These encodings correspond to 7-bit ASCII characters. In other cases, it may take up to four character groups to represent one character. These multi-byte maximum valid bits are set to 1 to prevent confusion with the ASCII characters of the 7-byte, and keep the standard byte-oriented string) smooth operation.

Code Scope
Hexadecimal
Scalar value)
Binary
UTF-8
Binary/hexadecimal
Note
000000-00007F
128 Codes
00000000 00000000 0 zzzzzzz 0 zzzzzzz (00-7F) ASCII character range, starting from zero
Seven z Seven z
000080-0007FF
1920 Codes
00000000 00000yyy yyzzzzzz 110 yyyyy (C2-DF) 10 zzzzzz (80-BF) The first byte starts from 110, And the next byte starts from 10.
Three y; two y; six z Five y; six z
000800-00D7FF
00E000-00 FFFF
61440 Codes[Note 1]
00000000 xxxxyyyy yyzzzzzz 1110 xxxx (E0-EF) 10 yyyyyy 10 zzzzzz The first byte starts from 1110, And the next byte starts from 10.
Four x; four y; two y; six z Four x; six y; six z
010000-10 FFFF
1048576 Codes
000 wwwxx xxxxyyyy yyzzzzzz 11110www (F0-F4) 10 xxxxxx 10 yyyyyy 10 zzzzzzzz Starting from 11110, followed by 10 bytes
Three w; two x; four y; two y; six z Three w; six x; six y; six z
Note 1Unicode does not have any character in the range D800-DFFF, which is agreed in the basic multiclass flat for the UTF-16 extended identity secondary flat (two UTF-16 represents one secondary flat character ). Of course, any encoding can be converted to this range, but in unicode they do not represent any valid value.

The above table is the key for php to intercept the UTF-8 string. It is determined by the first few digits of each byte that this character occupies several bytes (the UTF-8 encoding is somewhat similar to the encoding of five class IP addresses)

For example, the Unicode code for the Hebrew letter aleph (bytes) is U + 05D0, change to UTF-8 as follows:

  • It belongs to the U + 0080 to U + 07FF region. This table indicates that it uses dual-byte, 110 yyyyy 10zzzzzz.
  • The hexadecimal 0x05D0 is converted to binary 101-1101-0000.
  • Place these 11 digits in the "y" and "z" parts in sequence: 1101011110010000.
  • The final result is double byte, written in hexadecimal form is 0xD7 0x90, which is the UTF-8 encoding of this character aleph (bytes.

So the first 128 characters (US-ASCII) are only one byte, And the next 1920 characters require dual-byte encoding, including Latin letters with additional symbols, Greek letters, Spanish letters, A letter in the Coptic language, a letter in the Arabic language, a Hebrew letter, and an Arabic letter. The remaining characters in the basic multi-text plane use three bytes, and the remaining characters use four bytes.

Reason for designing UTF-8 (UTF-8 features)

The Design of UTF-8 has the following characteristics of a Multi-character group sequence:

  • The maximum valid element of a single byte is always 0.
  • The maximum valid bits of the first character group in a multibyte sequence determine the length of the sequence. The maximum valid bit is110Is a 2-byte sequence, and1110The three-byte sequence, and so on.
  • The first two most effective bitwise elements of the remaining bytes in the Multi-byte sequence are10.

These characteristics of UTF-8 ensure that the byte sequence of one character is not included in the byte sequence of another character. This ensures that the byte-Based string comparison (sub-string match) method can be used to search for words or words in text. Some old variable-length 8-bit encodings (such as Shift JIS) do not have this feature, so the string comparison algorithm becomes quite complex. Although this increases the Information Redundancy of UTF-8-encoded strings, the advantage is the disadvantage. In addition, data compression is not Unicode, so it cannot be confused. Even if some bytes are completely lost due to errors or interference during transmission, it is possible to re-sync at the starting point of the next character, limiting the damage range.

On the other hand, because of its byte sequence design, if a sequence suspected as a string is verified as a UTF-8 encoded, then we can safely say that it is a UTF-8 string. The probability of a Two-byte random sequence that happens to be a valid UTF-8 rather than ASCII is 32 points 1. If the probability of a three-byte sequence is 256-3, the probability of a longer sequence is lower.

  • In the ASCII code range, expressed in a byte, beyond the ASCII code range is expressed in bytes, which forms the representation of the UTF-8 we see above, the advantage of this Delimiter is that when a UNICODE file contains only ASCII code, the stored files are all one byte, so it is a common ASCII file. This is also true when reading, therefore, it is compatible with the previous ASCII files.
  • If it is greater than the ASCII code, the first few digits of the first byte above indicate the length of the unicode character. For example, the first three digits of 110xxxxxx indicate that this is a 2 byte UNICODE character; 1110xxxx is a three-digit UNICODE character, and so on. The xxx position is filled by the binary representation of the number of characters. The closer x is to the right, the less special it has. Use only the shortest multi-byte string that is sufficient to express the number of characters encoded. Note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string ..

UTF-8 features

  • The UCS character U + 0000 to U + 007F (ASCII) is encoded as byte 0x00 to 0x7F (ASCII compatible ), this also means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.
  • All> U + 007F UCOS characters are encoded as strings of multiple bytes, each of which has a tag set. Therefore, ASCII bytes (0x00-0x7F) cannot be part of any other character.
  • The first byte of a Multi-byte string that is not ASCII characters is always in the range from 0xC0 to 0xFD, and indicates the number of bytes contained in the character. The remaining bytes of the multibyte string are in the range of 0x80 to 0 x BF, which makes the re-synchronization very easy and makes the encoding without borders, and is rarely affected by the loss of bytes.
  • Can be compiled into all possible 231 UCS code
  • In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP characters can be up to 3 bytes long.
  • The order of the Bigendian UCS-4 byte strings is predetermined.
  • Bytes 0xFE and 0xFF have never been used in UTF-8 encoding, at the same time, the UTF-8 is a byte encoding unit, its byte order in all systems is a bytes, no problem of the byte order, therefore, it does not need BOM.
  • UTF-16 is less prone to problems for systems that do not support Unicode and XML than UTF-8 or other Unicode encoding.

Gb18030

It is fully compatible with GB 2312-1980 and is basically compatible with GBK. It supports all unified Chinese characters of GB 13000 and Unicode, and contains 70244 Chinese Characters in total.

GB 18030 has the following features:

  • Multi-byte encoding. Each word can contain one, two, or four bytes. (Variable-length encoding)
  • The encoding space is huge. It can contain up to 1.61 million characters.
  • Supports Chinese characters of ethnic minorities and does not need to be written in the word area.

Byte Structure

  • The value of a single byte ranges from 0 to 0x7F.
  • Double Byte. The value of the first byte ranges from 0x81 to 0xFE, and the value of the second byte ranges from 0x40 to 0xFE (excluding 0x7F ).
  • The value of the first byte ranges from 0x81 to 0xFE, the value of the second byte ranges from 0x30 to 0x39, and the value of the third byte ranges from 0x81 to 0xFE, the fourth byte ranges from 0x30 to 0x39.

Gb2312

GB2312 code is also used in mainland China and Singapore. Almost all Chinese systems and international software in mainland China support GB 2312.

The GB 2312 standard contains 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters, GB 2312 contains 682 full-angle characters including Latin letters, Greek letters, Japanese hirakana and katakana letters, and Russian Spanish letters.

The emergence of GB 2312 basically satisfies the computer processing needs of Chinese characters. The Chinese characters included in it cover 99.75% of the usage frequency in mainland China.

GB 2312 cannot be processed for uncommonly used characters such as personal names and ancient Chinese, which leads to the emergence of GBK and GB 18030 Chinese character sets.

In GB 2312, the received Chinese characters are partitioned. Each partition contains 94 Chinese characters/symbols. This representation is also called a location code.

  • The 01-09 area is a special symbol.
  • Areas 16-55 are top-level Chinese characters sorted by pinyin.
  • Area 56-87 contains second-level Chinese characters, which are sorted by the beginning or strokes.

Areas 10-15 and 8-94 are not encoded.

For example, the word "ah" is the first Chinese Character in GB2312, and its location code is 1601.

Byte Structure

In programs using GB2312, The EUC storage method is usually used to facilitate ASCII compatibility. The "GB2312" on the browser encoding table usually refers to the "EUC-CN" notation.

Each Chinese Character and symbol is expressed in two bytes. The first byte is called "high byte", and the second byte is called "low Byte ".

"High Byte" uses 0xA1-0xF7 (add the area code of area 01-87 with 0xA0), and "low Byte" uses 0xA1-0xFE (add 01-94 with 0xA0 ). The range of the "high byte" in the Chinese character area is 0xB0-0xF7, the range of "low Byte" is 0xA1-0xFE, and the occupied code bit is 72*94 = 6768. Five of them are D7FA-D7FE.

For example, the word "ah" is stored in two bytes in most programs, 0xB0 (first byte) 0xA1 (second byte. (Compare with the location code: 0xB0 = 0xA0 + 16, 0xA1 = 0xA0 + 1 ).

EUC-CN (online to the encoding table is this) (gb2312 string is intercepted according to the table)

EUC-CNIs the most commonly used Representation Method for GB 2312. The "GB2312" on the browser encoding table usually refers to the "EUC-CN" notation.

The 2312 characters in GB are expressed in two bytes.

Use 0xA1-0xF7 as the "first byte"

The second byte uses 0xA1-0xFE

For example, the word "ah" is the first Chinese Character in GB 2312, and its location code is 1601.

In the EUC-CN, it 0xA0 + 16 = 0xB0, 0xA0 + 1 = 0xA1, 0xB0A1.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.