UTF-8 and Unicode

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character encoding starts with ASCII. At that time, the United States proposed ASCII code to solve the character problem. Each character occupies one byte. However, the ASCII encoding only uses 7 digits, and one character is null. Later, when computers spread, various countries in Europe experienced new character encoding problems. For example, in France, many characters are marked on English characters. All the actual characters exceed 128 characters, but they are exactly 8 bytes in total, the one that is not used in ASCII is used. However, this causes character encoding conflicts between countries in Europe. That is to say, the first 128 characters are correct, but the last 128 characters are different in different countries, this makes information exchange between countries an obstacle. This obstacle has become more and more serious when computers are introduced into Asia and the world, hindering international information exchanges. How can this problem be solved?

In fact, it is also very simple, as long as the number of characters in the character encoding can be increased to accommodate all the characters in the world. Unicode Code came into being. However, you must know that each character occupies 4 bytes. This is unacceptable for the United States and European countries. Because the characters used in the United States and other countries only need one byte. While Unicode code will waste 3/4 of resources in these countries. This is unacceptable to the United States and other countries!

Therefore, the UTF-8 representation of Unicode code can solve this problem well. Of course there is UTF-16, utf-32 won't be discussed here.

UTF-8 is encoded in this way, in two cases in total. Assume that the UTF-8 encoding of this character occupies n Bytes.

1: If n = 1, the first byte is 0, followed by its unicode encoding.

2: If n> 1, the first n position of the first byte is 1, and the N + 1 position is 0. Then the first two digits of each byte are 10. Other BITs are encoded in Unicode.

From the above definition, we can see that UTF-8 uses unicode encoding, which is only a variant that removes invalid characters in Unicode encoding. Therefore, UTF-8 is a type of Unicode encoding.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

The Unicode character range shown above corresponds to UTF-8 encoding.

According to this rule, let's take a look at how the UTF-8 encoding of a character is generated?

The Unicode of "strict" is 4e25 (100111000100101), and the Unicode symbol range is 0000 0800-0000 FFFF, so we should adopt the third form 1110 XXXX 10 xxxxxx 10 xxxxxx, then fill 100111000100101 in the place of X, and the "strict" UTF-8 encoding is obtained as follows:

11101001 10111000 10100101.

Windows notepad provides four encoding methods:

1: ASCII

2: Unicode

3: Unicode-big endian // indicates the big endian

4: UTF-8

It should be noted that Unicode-encoded files are all in two characters indicating the end order, feff representing the big end order, and fffe representing the small end order. How can this problem be understood?

For example, the unicode format of "strict" is 4e25. If the storage format is 4e25, it indicates that the first byte is the large-end order.

We use little order for Intel processors.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

UTF-8 and Unicode

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

UTF-8 and Unicode

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support