UTF-8 and Unicode

Source: Internet
Author: User

Character encoding starts with ASCII. At that time, the United States proposed ASCII code to solve the character problem. Each character occupies one byte. However, the ASCII encoding only uses 7 digits, and one character is null. Later, when computers spread, various countries in Europe experienced new character encoding problems. For example, in France, many characters are marked on English characters. All the actual characters exceed 128 characters, but they are exactly 8 bytes in total, the one that is not used in ASCII is used. However, this causes character encoding conflicts between countries in Europe. That is to say, the first 128 characters are correct, but the last 128 characters are different in different countries, this makes information exchange between countries an obstacle. This obstacle has become more and more serious when computers are introduced into Asia and the world, hindering international information exchanges. How can this problem be solved?

In fact, it is also very simple, as long as the number of characters in the character encoding can be increased to accommodate all the characters in the world. Unicode Code came into being. However, you must know that each character occupies 4 bytes. This is unacceptable for the United States and European countries. Because the characters used in the United States and other countries only need one byte. While Unicode code will waste 3/4 of resources in these countries. This is unacceptable to the United States and other countries!

Therefore, the UTF-8 representation of Unicode code can solve this problem well. Of course there is UTF-16, utf-32 won't be discussed here.

 

UTF-8 is encoded in this way, in two cases in total. Assume that the UTF-8 encoding of this character occupies n Bytes.

1: If n = 1, the first byte is 0, followed by its unicode encoding.

2: If n> 1, the first n position of the first byte is 1, and the N + 1 position is 0. Then the first two digits of each byte are 10. Other BITs are encoded in Unicode.

 

From the above definition, we can see that UTF-8 uses unicode encoding, which is only a variant that removes invalid characters in Unicode encoding. Therefore, UTF-8 is a type of Unicode encoding.

 

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

 

The Unicode character range shown above corresponds to UTF-8 encoding.

 

According to this rule, let's take a look at how the UTF-8 encoding of a character is generated?

The Unicode of "strict" is 4e25 (100111000100101), and the Unicode symbol range is 0000 0800-0000 FFFF, so we should adopt the third form 1110 XXXX 10 xxxxxx 10 xxxxxx, then fill 100111000100101 in the place of X, and the "strict" UTF-8 encoding is obtained as follows:

11101001 10111000 10100101.

 

Windows notepad provides four encoding methods:

1: ASCII

2: Unicode

3: Unicode-big endian // indicates the big endian

4: UTF-8

 

It should be noted that Unicode-encoded files are all in two characters indicating the end order, feff representing the big end order, and fffe representing the small end order. How can this problem be understood?

For example, the unicode format of "strict" is 4e25. If the storage format is 4e25, it indicates that the first byte is the large-end order.

We use little order for Intel processors.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.