UTF-8 Implementation of ASCII, MBCS (DBCS) and Unicode

Source: Internet
Author: User

From: http://blog.csdn.net/stone_kingnet/article/details/3998761

1. ASCII code

We know that in a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight

Binary bits can be combined into 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different states.

A State corresponds to a symbol, that is, 256 symbols, from 0000000 to 11111111.

In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, always along

So far.

The ASCII code consists of a total of 128 characters. For example, the space is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary ).

01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte.

0.

2. Non-ASCII Encoding

It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it will not

In ASCII format. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (2

Hexadecimal 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters. Therefore, even if they are all encoded using 256 characters, they represent different letters.

Sample. For example, 130 represents é in French encoding, but gimel (?) in Hebrew encoding (?), It represents another symbol in Russian encoding. However

In any case, the 0-characters in all these encoding methods are the same, but the difference is only the 128-255.

As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. One byte can only represent 256 types of symbols. It must be

Express a symbol in multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can be expressed at most theoretically.

256x256 = 65536 characters.

The issue of Chinese encoding needs to be discussed in a specific article. This note does not cover this issue. It is pointed out that although all characters represent one symbol in multiple bytes

Unicode is irrelevant to the UTF-8.

3. Unicode

As mentioned in the previous section, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file,

You must know the encoding method. Otherwise, garbled characters may occur when you interpret the code in the wrong way. Why do emails often contain garbled characters? It is because the sender and receiver

The sender uses different encoding methods.

As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will be eliminated.

Loss. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 indicates Arabic

The English letter ain, U + 0041 represents the English capital letter A, U + 4e25 represents the Chinese character "strict ". The specific symbol corresponding table can be queried by unicode.org, or specialized

.

4. Unicode Problems

It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Chinese character "strict" Unicode is a hexadecimal number of 4 E25, and the conversion to the binary number has 15 full digits (100111000100101), that is to say, this symbol

It must contain at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.

There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent a symbol instead

Do not represent three symbols? The second problem is that we already know that it is enough to use only one byte for English letters. If Unicode is uniformly defined, each symbol uses three

Two or four bytes indicate that each English letter must have two to three bytes before it is 0, which is a great waste for storage and the size of the text file will be large.

Two or three times out, which is unacceptable.

The result is:

  • 1. There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode.

  • 2. Unicode cannot be promoted for a long time until the emergence of the Internet.

5. UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementation methods

It also includes UTF-16 and UTF-32, but basically not on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ Four bytes indicate a symbol, and the length of the byte varies according to different symbols.

Degree.

UTF-8 coding rules are very simple, only two:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. So for English letters, UTF-8 encoding and ASCII code is

Same.

2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. Not mentioned

And binary bits, all of which are Unicode codes of this symbol.

The following table summarizes the encoding rules. The letter X indicates the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Next, we take Chinese characters "strict" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the Unicode of "strict" is 4e25 (100111000100101). According to the above table, we can find that 4e25 is in the range of the third row (0000 0800-0000

FFFF), so the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then

First, enter X in the format from the back to the front, and add 0 to the extra bits. This gets, the "strict" UTF-8 code is "11100100

10111000 10100101 ", which is converted to the hexadecimal format e4b8a5.

6. Conversion between Unicode and UTF-8

Through the example in the previous section, we can see that the "strict" Unicode code is 4e25, The UTF-8 code is e4b8a5, the two are different. The conversion between them can be achieved through

Program implementation.

On the Windows platform, there is one of the simplest transformations. Instead, you can use the built-in deployment mini-program notepad.exe. After opening the file, click the "file" menu

In the "Save as" command, a dialog box is displayed, with a "encoding" drop-down at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. English files are ASCII encoded, and simplified Chinese files are gb2312 encoded (only for Windows Simplified Chinese versions,

If it is a traditional Chinese version, the big5 code will be used ).

2) unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes into the character Unicode code. This option uses the little endian format.

3) Unicode big endian encoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.

4) UTF-8 coding, that is, the encoding method mentioned in the previous section.

After selecting the encoding method, click the Save button to convert the file encoding method immediately.

7. little endian and big endian

As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Take the Chinese character "Yan" as an example. The Unicode code is 4e25 and needs to be stored in two bytes,

One byte is 4E and the other byte is 25. During storage, 4e is in the front, 25 is in the back, that is, the big endian mode; 25 is in the front, 4e is in the back, that is, little

Endian mode.

These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The war was caused by people arguing and eating eggs.

From big-Endian or from Little-Endian. For the sake of this incident, there were six wars before and after, and an emperor gave his life,

Another emperor lost his throne.

Therefore, the first byte is in front of the "Big endian", and the second byte is in front of the "little endian ).

Naturally, a problem arises: how does a computer know which encoding method is used for a file?

As defined in the Unicode specification, each file is preceded by a character indicating the encoding order. The character name is"Zero-width non-wrap space"

"(Zero Width no-break space), expressed in feff. This is exactly two bytes, and FF is 1 larger than Fe.

If the first two bytes of a text file are Fe ff, the file adopts the big-headed mode. If the first two bytes are FF Fe, the file uses a small header.

.

8. Instance

The following is an example.

Open the Notepad program notepad.exe and create a new text file. The content is a strict character, which uses ANSI, Unicode, and Unicode big.

The endian and UTF-8 encoding method are saved.

Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.

1) ANSI: The file encoding is the two-Byte "D1 CF", which is the "strict" gb2312 encoding, which also implies that gb2312 is stored in a large-headed manner.

2) UNICODE: the encoding is four bytes: "FF Fe 25 4E", where "ff fe" indicates that it is stored in a small header, and the actual encoding is 4e25.

3) Unicode big endian: the encoding format is four bytes: "Fe FF 4E 25". "Fe FF" indicates that it is stored as a large data source.

4) UTF-8: the encoding is six bytes "Ef bb bf E4 B8 A5 ",The first three bytes "ef bb bf" indicate this is UTF-8 EncodingThe last three "e4b8a5" are

Strict specific encoding, its storage order is consistent with the encoding order (Note: The UTF-8 uses the network byte order, that is, the large-end byte order ).

9. Extended reading

* The absolute minimum every software developer absolutely, positively must know about Unicode and character sets (basic knowledge about character sets)

* Unicode encoding

* Rfc3629: UTF-8, a transformation format of ISO 10646

Conclusion:

Unicode encoding is a encoding standard, that is, Unicode only assigns Integers to the character encoding table, while UTF-8, UTF-16 is its specific implementation method.

The UTF-16 uses two bytes to represent a character. The Unicode encoding is usually UTF-16 (or ucs-2), while UTF-8 is multi-byte storage, Character

Numbers are uncertain (for example, English characters are expressed in 1 byte, and Chinese characters can be expressed in 2 to 6 ), the first few digits of the character indicate the number of bytes.

For example, a 3 byte Chinese Character uft-8 encoding (Binary) is as follows:

1110 XXXX 10 xxxxxx 10 xxxxxx

3 In the first byte indicates that the Chinese character is expressed in 3 bytes.

That is to say, the ASCII characters are the same in UTF-8. one byte is used to indicate that if the ASCII character range is exceeded, multiple bytes are used to indicate that the number of bytes is determined by the first byte.

Set to a maximum of 6 bytes. As follows:

UTF-8:

1 byte: 0 xxxxxxx (ASCII)

2 bytes: 110 XXXXX 10 xxxxxx

3 bytes: 1110 XXXX 10 xxxxxx 10 xxxxxx

4 Bytes: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

5 bytes :...

UTF-16: All: XXXXXXXX

ASCII: XXXXXXXX 00000000

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.