Unicode (UTF-8, UTF-16) confusing concept

Source: Internet
Author: User

Why Unicode is required

We know that the computer is actually very stupid, it only know 0101 such a string, of course, we look at such a 01 string when it will be more dizzy, so many times in order to describe the simple are in decimal, hexadecimal, octal notation. are actually equivalent, It's not much different. Other things, such as text pictures and other things the computer does not know. That is, to represent this information on a computer, it must be converted to some number. You must not be able to change the way you want to turn, you have to have some rules. The ASCII character set (American Standard code for Information interchange, "US Information Interchange Standards Code", which uses 7 bits to represent a character, representing a total of 128 characters, we generally use bytes (byte, that is, 8 01 strings) As the basic unit. So how to use a byte to denote a character when the first bit is always 0, The remaining seven bytes represent the actual content. IBM later expanded on this basis, using 8bit to represent a single character, representing a total of 256 characters. That is, when the first bit is 0 o'clock, it still represents the usual characters before. When 1 o'clock is the expression of other supplemental characters.

There will be no more than 256 English letters plus some other punctuation characters. One byte indicates that the master is enough. But there are some other words, like tens of thousands of characters. Then there are other character sets. So there's a problem with the different character sets. You might be using a number to represent the character A, But the other character set is represented by a different number. So it's troublesome to interact with each other. Thus, organizations such as Unicode and ISO are set up to unify a standard, and any one character corresponds only to a certain number. The ISO takes the name UCS (Universal Character Set), and the Unicode name is called Unicode.

To summarize why the need for Unicodey is to adapt to the development of globalization, facilitate the compatibility of different languages between the interaction, and ASCII is no longer competent for this task.

Unicode Detailed Introduction

1. Two bytes easy to produce after ambiguity

The first version of Unicode is a two-byte (16bit) representation of all characters

In fact, it's easy to be ambiguous, and we always feel that two bytes is two bytes when it is saved in the computer. So any character that is represented by Unicode is saved to two bytes. In fact, this is a mistake.

In fact, Unicode involves two steps, the first is to define a specification, give all the characters a unique corresponding number, this is a mathematical problem, can be with the computer does not have a half gross money relationship. The second step is how to save the number of characters corresponding to the computer, which involves the actual number of bytes in the computer space.

So we can also understand that, Unicode is a number between 0 and 65535 to represent all characters. where 0 to 127 of these 128 digits represent the same characters as ASCII. 65536 is 2 of the 16. This is the first step. The second step is how to convert 0 to 65535 of these numbers into 01 strings and save them to the computer. There's gotta be a difference. Save the way. Then UTF (Unicode Transformation format) appears, with Utf-8,utf-16.

The difference between 2.utf-8 and UTF-16

UTF-16 a good understanding, that is, any character corresponding to the number is stored in two bytes. The common misconception about Unicode is that Unicode is equated with UTF-16. But obviously if it's all English letters, it's a bit wasteful. Clearly with a word energy-saving means a character why the whole two ah.

So there is a UTF-8, 8 here is very easy to mislead people, 8 is not a byte, does a byte represent a character? Not really. When using UTF-8, it is possible to indicate that a character is mutable, possibly a byte or two, Three. No more than 3 bytes, of course. It is determined by the number size of the character.

So the pros and cons of UTF-8 and UTF-16 are easy to see. If all English or English are mixed with other words, but English is the majority, The use of UTF-8 will save a lot of space than UTF-16. And if all is Chinese such a character or mixed characters in Chinese is the overwhelming majority. UTF-16 is the dominant, can save a lot of space. There is also a fault-tolerant problem,

Look a little dizzy, for example. If the Chinese character "Han" corresponds to the Unicode is 6c49 (this is in hexadecimal notation, in decimal notation is 27721 Why do not use decimal notation?) It is clear that hexadecimal is the shortest point. In fact, they are all equivalent. I told you 60 minutes and 1 hours. You might ask how do we know when a file is opened with a program UTF-8 or UTF-16? Naturally, there is a sign, a few bytes at the beginning of the file is the flag.

EF BB BF represents UTF-8

FE FF represents UTF-16.

Use UTF-16 to express "Han"

If you use UTF-16, it is 01101100 01001001 (a total of two bytes). When the program is parsed, it is UTF-16 to parse two bytes as a unit. This is simple.

Use UTF-8 to express "Han"

With UTF-8, there is a complex point. Because the program is to read one byte at a time, and then according to the bit flag at the beginning of the byte to identify whether the 1 or two or three bytes as a unit to deal with.

0xxxxxxx, if it is such a 01 string, that is, 0 after the beginning of what is not to control the XX represents any bit. It means to make a byte as a unit. It's exactly the same as ASCII.

110xxxxx 10xxxxxx. If this is the format, then put two bytes as a unit

1110xxxx 10xxxxxx 10xxxxxx If this is the format then it is three bytes when a unit.

This is the agreed rule. You must obey the rules when you use UTF-8. We know that UTF-16 does not need to use characters to make a flag, so two bytes is 2 16 times can represent 65,536 characters.

and UTF-8 because of the additional flag information inside, all one byte can only represent 2 of the 7 128 characters, two bytes can only represent 2 11 characters. and 2048 bytes can represent three 2, 16 characters.

Since the code for "Han" is 27721 greater than 2048, all two bytes are not enough and can only be represented by three bytes.

All to use 1110xxxx 10xxxxxx 10xxxxxx in this format. Fill the 27721 binary from left to right with the xxx symbol (actually not necessarily from left to right or from right to left, which involves another problem.).

Just said the filling method can be different, so there is a big-endian,little-endian term. Big-endian is from left to right, Little-endian from right to left.

From the above we can see that UTF-8 need to determine the beginning of each byte of the flag information, so if a byte in the transfer process error, it will cause the subsequent bytes will also parse an error. and UTF-16 will not judge the beginning of the flag, even if the wrong is only one character, so fault-tolerant ability.

Unicode version 2

All that was said was the first version of Unicode. But 65536 obviously not too many numbers, it is not a problem to use it to denote commonly used characters. enough, But if you add a lot of special, it's not enough. So from 1996 onwards came the second version. All characters are represented in four bytes. So there's utf-8,utf16,utf-32. The principle is exactly the same as before. UTF-32 is to use all the characters in 32bit, which is 4 bytes. Then the utf-8,utf-16 depends on the situation. UTF-8 can be represented by selecting any of 1 to 8 bytes. And UTF-16 can only be selected two bytes or four bytes. Since the Unicode version 2 principle is exactly the same, it is not much to say.

Before you know what kind of coding, you need to judge the text at the beginning of the flag, the following is the beginning of all the code corresponding to the logo (BOM header)

EF BB BF UTF-8
FE FF Utf-16/ucs-2, Little endian
FF FE utf-16/ucs-2, big endian
FF FE xx utf-32/ucs-4, little endian.
The FE FF utf-32/ucs-4, Big-endian.

The UCS is the standard set out in the previous ISO, and Unicode is exactly the same, except that the name is different. ucs-2 corresponds to utf-16,ucs-4 corresponding utf-32.utf-8 is no corresponding UCS

Unicode (UTF-8, UTF-16) confusing concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.