ASCII, UTF-8, and Unicode character encoding specifications

Source: Internet
Author: User
ASCII

The ASCII code consists of a total of 128 characters. It only occupies the last seven digits of a byte, and the first one character is uniformly set to 0. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). The 128 symbols (including 32 control symbols that cannot be printed ).

Unicode

As mentioned in the previous section, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.

As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter ain, U + 0041 represents the English capital letter A, and U + 4e25 represents the Chinese character "strict ". You can query a specific symbol table at unicode.org or a special Chinese character table.

Unicode Problems

Note that,Unicode is just a collection of symbols. It only specifies the binary code of a symbol, but does not specify how the binary code should be stored.

For example, the Unicode Character "strict" is a hexadecimal number of 4 E25, and the number of bytes converted to binary is 15 (100111000100101). That is to say, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.

There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.

The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.

UTF-8

UTF-8 is a unicode implementation method, that is, its byte structure has special requirements, so we say a Chinese character range is 0x4e00 to 0x9fa5, refers to the Unicode value, the UTF-8 encoding is organized by three bytes, so we can see thatUnicode is a range of characters that defines the value of a Code. There are multiple implementation methods.

UTF-8 is a variable-length Byte encoding method. For the UTF-8 encoding of a character, if there is only one byte, its maximum binary is 0; if it is multi-byte, its first byte starts from the highest bit, the number of consecutive binary values of 1 determines the number of encoded bytes, and the remaining bytes start with 10. The UTF-8 can be up to 6 bytes. 
For example:
1 byte 0 xxxxxxx
2 bytes 110 XXXXX 10 xxxxxx
3 bytes 1110 XXXX 10 xxxxxx 10 xxxxxx
4 bytes 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
5 bytes 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
6 bytes 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Therefore, the actual number of digits in the UTF-8 that can be used to indicate character encoding can be 31 at most (6 bytes), that is, the number of digits indicated by X in the preceding table. Except for the control bits (10 at the beginning of each byte), the bits indicated by X correspond to unicode encoding in a one-to-one manner, with the same order of bits.
Unicode to UTF-8: The actual conversion of Unicode to UTF-8 encoding should first remove the high 0, then according to the number of digits of the remaining encoding determine the minimum number of digits of the UTF-8 encoding.

Therefore, the characters in the basic ASCII character set can be represented by only one byte of UTF-8 encoding (7 bits.

The range of Chinese characters is 0x4e00 to 0x9fa5 (that is, 0100 1110 0000 to 0000 1001 1111 requires 15 data bits), which must be expressed in three bytes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.