But I this feature is the principle of investigation, I care about things want to understand, so the QQ group in turn send information, no one heeded. Alas, depressed. Had to own Google it and teach myself. The following is a detailed description.
There is no one to ask for help, I have some personal thoughts. Nowadays people have very few to delve into theory, people's idea is to muddle along, people usually just know what, do not know why. For programming, individuals think this is a sad thing and a very dangerous thing. I think maybe this is why China's it lags behind the United States, I hope Chinese programmers can think about it.
The following items were found on the Internet.
encoding and implementation of Unicode
Roughly speaking, the Unicode coding system can be divided into two levels of encoding and implementation.
Encoding Method
Unicode is encoded in a way that corresponds to the concept of the ISO 10646 Universal Character Set (Universal Character Set,ucs), and the currently applied Unicode version corresponds to the UCS-2 , using a bit of coding space. That is, each character occupies 2 bytes . In this way, the theory can represent a maximum of 2 or 65,536 characters. Basically meet the use of various languages. In fact, the current version of Unicode is not filled with these 16-bit encodings and retains a lot of space for special use or future expansion.
The above 16-bit Unicode characters form the Basic Multilingual Plane (Basic multilingual Plane, referred to as BMP). The most recent (but not actually widely used) version of Unicode defines 16 secondary planes , which together require at least 21 bits of coding space, slightly less than 3 bytes. In fact, auxiliary plane characters still occupy 4 bytes of encoded space, consistent with UCS-4 . Future versions will be expanded to ISO 10646-1 implementation Level 3, which covers all UCS-4 characters. UCS-4 is a larger, yet fully populated 31-bit character set, plus a constant of 0 first, a total of 32 bits, or 4 bytes. Theoretically can represent up to 2 characters, can cover all the symbols used in language.
The Unicode encoding of BMP characters is represented as u+hhhh, where each h represents a hexadecimal digit. Exactly the same as the UCS-2 encoding. the corresponding 4-byte UCS-4 is encoded with two bytes, all bits of the first two bytes are 0.
For a detailed relationship between Unicode and ISO 10646 and UCS, see the Universal Character set .
Implementation mode
Unicode is implemented in a different way than encoding. The Unicode encoding of a character is determined. But in the actual transmission process, because different system platform design is not necessarily consistent, and for space-saving purposes, the implementation of Unicode encoding is different. Unicode is implemented as a Unicode conversion format (Unicode translation format, abbreviated as UTF).
For example, if a Unicode file contains only basic 7-bit ascii characters, if each character is transmitted using a 2-byte original Unicode encoding, 8 bits of the first byte are always 0. This has resulted in a relatively large waste. In this case, you can use the utf-8 encoding, which is a variable-length encoding that will still represent the base 7-bit ASCII character in 7-bit encoding, taking a byte (first up to 0). When mixed with other Unicode characters, it will be converted to a certain algorithm, each character is encoded with 1-3 bytes, and the first is identified by 0 or 1. This saves the encoding length of a Latin document that is based on 7-bit ASCII characters (see utf-8 ) for specific scenarios. Similarly, for future occurrences of 4-byte auxiliary plane characters and other UCS-4 extensions, the 2-byte encoded utf-16 also needs to be converted by a certain algorithm.
Again, if you use the
In addition, the implementation of Unicode also includes utf-7 , punycode , cesu-8 , SCSU , utf-32 and so on, Some of these implementations are only used in certain countries and regions, and some are the future planning methods. At present, the common realization Way is UTF-16 small tail order (BOM), UTF-16 large tail order (BOM) and UTF-8. In Microsoft windows XP operating system accompanying Font color= "#336699" > Notepad (notepad "), "Save as" dialog box Four encodings that you can choose to remove non-Unicode encoded ANSI (for the English system that is ASCII encoding, the Chinese system is GB2312 or Big5 encoded), the remaining three are "Unicode" (corresponding to UTF-16 LE), "Unicode big Endian" (corresponding to UTF-16 be), and "UTF-8".
At present, the work of auxiliary plane mainly concentrates on the second and third plane's Unified Ideograph , so it includes GBK,GB18030,Big5 and other Simplified Chinese , the various encodings of traditional Chinese, Japanese , Korean , and Vietnamese characters are focused on the harmonization of Unicode. Given that Unicode ultimately covers all the characters, in a sense, these encodings can also be seen as implementations of Unicode in its prior fait accompli, as in the case of ASCII and its extended Latin-1 , The two characters in the 16-bit Unicode encoding space encode the first byte all of you are 0, the second byte encoding is exactly the same as the original encoding. However, the correspondence between the above East Asian language coding and the Unicode encoding is much more complicated.
Utf-8
UTF-8(8 -bit Universal Character Set/unicode transformation Format) is a type of Unicode Variable-length character encoding. It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII , which allows the software that originally handles ASCII characters It is not necessary or necessary to make a small part of the changes to continue to use. As a result, it gradually becomes the preferred encoding for e-mail , Web pages and other applications that store or transmit text.
UTF-8 uses one to four bytes for each character encoding:
- 128 US-ASCII characters are encoded in only one byte (Unicode range from u+0000 to u+007f).
- Latin , Greek , Cyrillic , Armenian , Hebrew , Arabic , Syriac with additional symbols and It takes two bytes to encode (Unicode range from u+0080 to u+07ff).
- The characters in other basic multilingual Planes (BMP), which contain most of the commonly used words, are encoded using three bytes.
- Other characters that rarely use Unicode auxiliary planes use a four-byte encoding.
For the fourth character mentioned above, it seems too expensive for UTF-8 to use four bytes to encode. However, UTF-8 can be represented in three bytes for all commonly used characters, and its alternative,UTF-16 encoding , also requires four bytes to encode for the fourth character mentioned above, so deciding which encoding UTF-8 or UTF-16 is more efficient, Also depends on the range of characters you use. However, if you use some traditional compression systems, such as DEFLATE, the difference between these different coding systems becomes trivial. Consider using the Standard Compression Scheme for Unicode(SCSU) If the traditional compression algorithm does not have much effect in compressing shorter text.
The bits of a Unicode character are split into several parts and assigned to the position of the lower bits in the UTF-8 byte string. The following characters in u+0080 use Single-byte encodings that contain their characters . These encodings correspond exactly to 7-bit ASCII characters. In other cases, it is possible to require up to 4 characters Fu Zulai to represent one character. The most significant bits of these multi-byte bytes are set to 1 to prevent confusion with the 7-bit ASCII character and to keep the standard byte-dominant string (standard byte-oriented string) running smoothly.
Code Range hexadecimal |
scalar value (scalar value) Binary system |
UTF-8 binary / hexadecimal |
annotations |
000000-00007f 128 Code |
00000000 00000000 0zzzzzzz |
0zzzzzzz (00-7f) |
ASCII character range, byte starting from zero |
Seven Z |
Seven Z |
000080-0007ff 1920 Code |
00000000 00000yyy yyzzzzzz |
110YYYYY (C2-DF) 10zzzzzz (80-BF) |
The first byte starts with 110, and then the byte starts with 10. |
Three y; two y; six Z |
Five y; six Z |
000800-00d7ff 00e000-00ffff 61,440 code [Note 1] |
00000000 Xxxxyyyy yyzzzzzz |
1110xxxx (E0-EF) 10yyyyyy 10zzzzzz |
The first byte starts with 1110, and then the byte starts with 10. |
four x; four y; two y; six Z |
four x; six y; six Z |
010000-10ffff 1,048,576 Code |
000WWWXX xxxxyyyy yyzzzzzz |
11110www (F0-F4) 10xxxxxx 10yyyyyy 10zzzzzz |
Start with 11110, and then the byte starts with 10. |
three w; two x; four x; four y; two y; six Z |
Three w; six x; six y; six Z |
Note
-
1 Unicode does not have any characters in the range D800-DFFF, which are agreed in the Basic Multilingual Plane for UTF-16 extended identity Auxiliary plane (Two UTF-16 represents a secondary planar character). Of course, any encoding can be converted to this scope, but they do not represent any legitimate value in Unicode.
The above table is the key for PHP to intercept the Utf-8 string, depending on the number of digits in the first few bytes of each byte (the Utf-8 encoding is somewhat similar to the 5-class IP address encoding)
For example, the Unicode code for the Hebrew Letter aleph (א) is u+05d0, which is changed to UTF-8 by the following method:
- It belongs to the u+0080 to the U+07ff area, and this table shows that it uses Double-byte, 110yyyyy 10zzzzzz.
- hexadecimal 0x05d0 converted to binary is 101-1101-0000.
- This 11-digit number is placed in the "Y" section and the "Z" section in order:10111 010000.
- The final result is Double-byte, written in hexadecimal is 0xd7 0x90, which is the UTF-8 encoding of this character Aleph (א).
So the starting 128 characters (US-ASCII) are only one byte, and the next 1920 characters require Double-byte encodings, including the Latin alphabet with the appended symbol , the Greek alphabet , the Cyrillic alphabet , Coptic Letters, Armenian letters, Hebrew letters and Arabic alphabet characters. The remaining characters in the Basic Multilingual Plane use three bytes, and the remaining characters use four bytes.
Reasons for designing UTF-8 (Utf-8 Features)
The UTF-8 design has the following characteristics of a multiple-character group sequence:
- The maximum valid bit for a single-byte character is always 0.
- Several of the highest valid bits of the first character group in a multi-byte sequence Determine the length of the sequence. The most significant bit
110
is a 2-byte sequence, which 1110
is a three-byte sequence, and so on.
- The first two most significant bits in the remaining bytes in a multi-byte sequence are
10
.
These qualities of UTF-8 ensure that a byte sequence of one character is not included in the byte sequence of another character. This ensures that the byte based partial string alignment (Sub-string match) method can be used to search for words or words in text. Some older variable-length 8-bit encodings (such as Shift JIS) do not have this trait, so the algorithm of string alignment becomes quite complex. While this increases the information redundancy of UTF-8 coded strings, it does more harm than good. In addition, data compression is not the purpose of Unicode, so it cannot be confused. Even if a portion of the byte is completely lost due to errors or disturbances during the transfer, it is possible to resynchronize the beginning of the next character, limiting the scope of the damage.
On the other hand, because of its byte sequence design, if a sequence of suspected strings is validated as UTF-8 encoding, then we can safely say that it is a UTF-8 string. A two-byte random sequence that happens to be a legitimate UTF-8 rather than ASCII is 32 minutes 1. The probability of a three-byte sequence is 256 to 3, and the probability of a longer sequence is lower.
-
- In the range of ASCII code, a byte representation is expressed in bytes beyond the ASCII range, which forms the UTF-8 representation we see above, and the GOSIP advantage is that when there is only ASCII code in the Unicode file, the file stored is one byte, So the normal ASCII file is the same as reading, so it can be compatible with the previous ASCII file.
-
- is greater than ASCII, the length of the Unicode character is represented by the first few bytes above, such as the three-bit binary representation of the top 110xxxxxx tells us that this is a 2BYTE Unicode character; 1110xxxx is a three-bit Unicode character, and so on The position of XXX is filled in by a bit of the binary representation of the character encoding number. The more right x has the less special meaning. Use the shortest one enough to express a multibyte string of one character encoding number. Note that in a multi-byte string, the number of the first byte at the beginning of "1" is the number of bytes in the entire string.
Characteristics of Utf-8
- The UCS character u+0000 to u+007f (ASCII) is encoded as Byte 0x00 to 0x7F (ASCII-compliant), which means that files containing only 7-bit ASCII characters are the same in ASCII and UTF-8 two encoding modes.
- All >u+007f UCS characters are encoded as a string of multiple bytes, each of which has a set of marked bits. As a result, ASCII bytes (0x00-0x7f) cannot be part of any other character.
- The first byte of a multibyte string that represents a non-ASCII character is always in the range of 0xc0 to 0xFD, and indicates how many bytes the character contains. The remainder of the multibyte string is in the range of 0x80 to 0xBF, which makes resynchronization very easy and makes coding borderless and rarely affected by lost bytes.
- Can be programmed into all possible 231 UCS codes
- UTF-8 encoded characters can theoretically be up to 6 bytes long, while 16-bit BMP characters use up to 3 bytes long.
- The order in which Bigendian UCS-4 byte strings is ordered.
- Bytes 0xFE and 0xFF are never used in UTF-8 encoding, and UTF-8 is encoded in bytes, its byte order is a GOSIP in all systems, there is no problem with the byte sequence, and therefore it does not actually require a BOM.
- Compared to UTF-16 or other Unicode encodings, UTF-8 is less likely to cause problems for systems that do not support Unicode and XML.
GB18030
Fully compatible with the GB 2312-1980 , and GBK basic compatibility, support GB 13000 and Unicode all Unified Chinese characters, a total of 70,244 Chinese characters.
GB 18030 mainly has the following characteristics:
- With multibyte encoding, each word can consist of one, 2, or 4 bytes. (variable length code)
- The coding space is large and can be defined up to 1.61 million characters.
- Support the Chinese minority 's writing, does not need to use the word-formation area.
BYTE structure
- A single byte whose value is from 0 to 0x7f.
- Double-byte, the value of the first byte from 0x81 to 0xFE, and the value of the second byte from 0x40 to 0xFE (excluding 0x7f).
- Four bytes, the value of the first byte from 0x81 to 0xFE, the value of the second byte from 0x30 to 0x39, the third byte from 0x81 to 0xFE, and the fourth byte from 0x30 to 0x39.
gb2312
GB2312 codes are used in mainland China, and in Singapore and other fields. Almost all Chinese-language systems and international software support GB 2312 in mainland China.
GB 2312 Standard includes 6,763 Chinese characters, one level 3,755 Chinese characters , two Chinese characters 3,008; At the same time, GB 2312 included the latin alphabet , Greek alphabet , Japanese hiragana and katakana Letters, Russian Cyrillic alphabet , 682 full-width characters.
The emergence of GB 2312, basically meet the needs of computer processing of Chinese characters, it has been included in Chinese characters have covered 99.75% of the use of China's frequency.
for the People 's names , ancient Chinese and Other aspects of the general antiseptic Word, GB 2312 can not be processed, which led to the later GBK and The appearance of GB 18030 character set.
In GB 2312, the received Chinese characters are " partitioned", with 94 characters/symbols per zone. This representation is also called Location code .
- Area 01-09 is a special symbol.
- 16-55 is a class of Chinese characters, sorted by pinyin.
- The 56-87 area is a class two Chinese character, sorted by radical/stroke.
10-15 Districts and 88-94 districts are not encoded.
For example, the word "ah" is the first Chinese character in the GB2312, and its location code is 1601.
byte structure
In programs that use GB2312, the EUC storage method is usually used for compatibility with ASCII. The "GB2312" on the browser 's coded table usually refers to the "EUC-CN" notation.
Each character and symbol is expressed in two bytes . The first byte is called "High byte", and the second byte is called "Low byte."
"High byte" uses 0xa1-0xf7 (the area code of area 01-87 plus 0xa0), "Low byte" uses 0xa1-0xfe (01-94 plus 0xa0). Since the first level of Chinese characters from the beginning of 16, the "High byte" range is 0xb0-0xf7, "low byte" range is 0xa1-0xfe, occupy the code bit is 72*94=6768. 5 of these vacancies are d7fa-d7fe.
For example, the word "ah" is stored in most programs in two bytes, 0xb0 (the first byte) 0xa1 (the second byte). (compared with location code: 0XB0=0XA0+16,0XA1=0XA0+1).
EUC-CN (This is the Code table below) (the gb2312 string intercept is also intercepted according to the table)
EUC-CN is the most commonly used notation for GB 2312 . The "GB2312" on the browser 's coded table usually refers to the "EUC-CN" notation.
GB 2312 characters are represented using two bytes.
- "First byte" uses 0xa1-0xf7
- "second byte" to use 0xa1-0xfe
For example, the word "ah" is the first Chinese character in GB 2312, and its location code is 1601.
In the midst of the EUC-CN, it 0xa0+16=0xb0,0xa0+1=0xa1 and draws the 0xb0a1.