Unicode (UTF-8, UTF-16) confusing concept

Source: Internet
Author: User

Why Unicode is required

We know that the computer is actually very stupid, it only know 0101 such a string , of course, we look at such a 01 string when it will be more dizzy, so many times in order to describe the simple are in decimal, hexadecimal, octal notation. are actually equivalent, It's not much different. Other things, such as text pictures and other things the computer does not know. That is, to represent this information on a computer, it must be converted to some number. You must not be able to change the way you want to turn, you have to have some rules. The ASCII character set (American Standard code for Information interchange, "US Information Interchange Standards Code", which uses 7 bits to represent a character, representing a total of 128 characters, we generally use bytes (byte, that is, 8 01 strings) As the basic unit. So how to use a byte to denote a character when the first bit is always 0, The remaining seven bytes represent the actual content. IBM later expanded on this basis, using 8bit to represent a single character, representing a total of 256 characters. That is, when the first bit is 0 o'clock, it still represents the usual characters before. When 1 o'clock is the expression of other supplemental characters.

there will be no more than 256 English letters plus some other punctuation characters . One byte indicates that the master is enough. But there are more words than that, Like Chinese characters on tens of thousands. Then there are various other character sets. So different character sets have problems exchanging data. Maybe you use a number to represent the character A, But the other character set is represented by a different number. So it's troublesome to interact with each other. Thus, organizations such as Unicode and ISO are set up to unify a standard, and any one character corresponds only to a certain number. The ISO takes the name UCS (Universal Character Set), and the Unicode name is called Unicode.

To summarize why the need for Unicodey is to adapt to the development of globalization, facilitate the compatibility of different languages between the interaction, and ASCII is no longer competent for this task.

Unicode Detailed Introduction

1. Two bytes easy to produce after ambiguity

The first version of Unicode is a two-byte (16bit) representation of all characters

In fact, it's easy to say that it's confusing, and we always think that two bytes is two bytes when it is saved in the computer. So any character that is represented by Unicode is saved to two bytes. In fact, this is a mistake.

In fact, Unicode involves two steps, the first is to define a specification, give all the characters a unique corresponding number, this is a mathematical problem, can be with the computer does not have a half gross money relationship. The second step is how to save the number of characters corresponding to the computer, which involves the actual number of bytes in the computer space.

So we can also understand that, Unicode is a number between 0 and 65535 to represent all characters. where 0 to 127 of these 128 digits represent the same characters as ASCII. 65536 is 2 of the 16. This is the first step. The second step is how to convert 0 to 65535 of these numbers into 01 strings and save them to the computer. There's gotta be a difference. Save the way. Then UTF (Unicode Transformation format) appears, with Utf-8,utf-16.

The difference between 2.utf-8 and UTF-16

UTF-16 Better understand, The number that corresponds to any character is saved in two bytes. Our common misconception about Unicode is to equate Unicode with UTF-16. But obviously if it's all English letters, it's a bit wasteful. Clearly use a word energy-saving means a character why the whole two ah.

So there is a UTF-8, 8 here is very easy to mislead people, 8 is not a byte, does a byte represent a character? Not really. When using UTF-8, it is possible to indicate that a character is mutable, possibly a byte or two, Three. No more than 3 bytes, of course. It is determined by the number size of the character.

So the pros and cons of UTF-8 and UTF-16 are easy to see. If all English or English are mixed with other words, but English is the majority, The use of UTF-8 will save a lot of space than UTF-16. And if all is Chinese such a character or mixed characters in Chinese is the overwhelming majority. UTF-16 is the dominant, can save a lot of space. There is also a fault-tolerant problem,

Look a little dizzy, for example. If the Chinese character "Han" corresponds to the Unicode is 6c49 (this is in hexadecimal notation, in decimal notation is 27721 Why do not use decimal notation?) It is clear that hexadecimal is the shortest point. In fact, they are all equivalent. I told you 60 minutes and 1 hours. You might ask how do we know when a file is opened with a program UTF-8 or UTF-16? Naturally, there is a sign, a few bytes at the beginning of the file is the flag.

EF BB BF represents UTF-8

FE FF represents UTF-16.

Use UTF-16 to express "Han"

If you use UTF-16, it is 01101100 01001001 (a total of two bytes). When the program is parsed, it is UTF-16 to parse two bytes as a unit. This is simple.

Use UTF-8 to express "Han"

With UTF-8, there is a complex point. Because the program is to read one byte at a time, and then according to the bit flag at the beginning of the byte to identify whether the 1 or two or three bytes as a unit to deal with.

0xxxxxxx, if it is such a 01 string, that is, 0 after the beginning of what is not to control the XX represents any bit. It means to make a byte as a unit. It's exactly the same as ASCII.

110xxxxx 10xxxxxx. If this is the format, then put two bytes as a unit

1110xxxx 10xxxxxx 10xxxxxx If this is the format then it is three bytes when a unit.

This is the agreed rule. You must obey the rules when you use UTF-8. We know that UTF-16 does not need to use characters to mark, so two bytes is 2 16 times can represent 65,536 characters.

and UTF-8 because of the additional flag information inside, all one byte can only represent 2 of the 7 128 characters, two bytes can only represent 2 11 characters. and 2048 bytes can represent three 2, 16 characters.

Since the code for "Han" is 27721 greater than 2048, all two bytes are not enough and can only be represented by three bytes.

All to use 1110xxxx 10xxxxxx 10xxxxxx in this format. Fill the 27721 binary from left to right with the xxx symbol (actually not necessarily from left to right or from right to left, which involves another problem.).

Just said the filling method can be different, so there is a big-endian,little-endian term. Big-endian is from left to right, Little-endian from right to left.

From the above we can see that UTF-8 need to determine the beginning of each byte of the flag information, so if a byte in the transfer process error, it will cause the subsequent bytes will also parse an error. and UTF-16 will not judge the beginning of the flag, even if the wrong is only one character, so fault-tolerant ability.

Unicode version 2

All that was said was the first version of Unicode. But 65536 obviously not too many numbers, it is not a problem to use it to denote commonly used characters. enough, But if you add a lot of special, it's not enough. So from 1996 onwards came the second version. All characters are represented in four bytes. So there's utf-8,utf16,utf-32. The principle is exactly the same as before. UTF-32 is to use all the characters in 32bit, which is 4 bytes. Then the utf-8,utf-16 depends on the situation. UTF-8 can be represented by selecting any of 1 to 8 bytes. And UTF-16 can only be selected two bytes or four bytes. Since the Unicode version 2 principle is exactly the same, it is not much to say.

Before you know what kind of coding, you need to judge the text at the beginning of the flag, the following is the beginning of all the code corresponding to the flag

EF BB BF UTF-8
FE FF Utf-16/ucs-2, Little endian
FF FE utf-16/ucs-2, big endian
FF FE xx utf-32/ucs-4, little endian.
The FE FF utf-32/ucs-4, Big-endian.

The UCS is the standard set out in the previous ISO, and Unicode is exactly the same, except that the name is different. ucs-2 corresponds to utf-16,ucs-4 corresponding utf-32.utf-8 is no corresponding UCS


UTF-16 is not a perfect choice, it has several aspects of the problem:
    1. UTF-16 can represent more than 60,000 characters, but in fact the Unicode 5.0 contains characters that have already reached 99,024 characters long before the UTF-16 storage range, which leads directly to UTF-16 status rather awkward-if anyone is thinking about using utf- 16, I'm afraid I'm going to be disappointed.
    2. UTF-16 the existence of size-end byte-order problem, this problem in the exchange of information is particularly prominent-if the byte sequence is not negotiated well, will lead to garbled; if the negotiation is good, but the two sides of a big big one using small end, then there must be one side to do the size end conversion, Performance loss is unavoidable (the size of the problem is not as simple as it seems, sometimes involving hardware, operating system, upper layers of software, may be converted multiple times)
    3. In addition, low fault tolerance is sometimes a big problem-local byte errors, especially loss or increase may cause all subsequent characters to be completely confused, after the confusion to restore, may be very simple, it can be very difficult. (This is not a trivial thing in everyday life, but it is a huge flaw in many special circumstances.)
The reason why we continue to use UTF-16 is to consider that it is double-byte, which is very fast when calculating the length of a string and performing an index operation. Of course these advantages UTF-32 all have, but many people still feel UTF-32 too occupy space.

in turn UTF-8 is not perfect, there are some problems:
    1. Cultural imbalance--for some English-speaking countries in Europe and the United States UTF-8 is fantastic, because it is like ASCII, a character only one byte, no additional storage burden, but for countries such as Japan and South Korea, UTF-8 is too redundant, a character to occupy 3 Bytes, the efficiency of storage and transmission has not improved, but decreased. So the European and American people often do not hesitate to adopt UTF-8, and we always have to hesitate for a while
    2. The problem of the efficiency of variable-length bytes is that the problem with UTF-8 is that because it is a variable-length byte representation, it is inefficient to calculate the number of characters or perform an index operation. In order to solve this problem, often consider UTF-8 first converted to UTF-16 or UTF-32 after the operation, the operation is completed and then converted back. And this is obviously a performance burden.


Of course, the advantages of UTF-8 can not be forgotten:
    1. The character space is large enough for future Unicode new standards to include more characters, UTF-8 can also be properly compatible, so there will be no more UTF-16 that embarrassment
    2. There is no byte order problem on the size side, it is very convenient to exchange information
    3. High fault tolerance, local byte errors (loss, increase, change) do not cause cascading errors, because the character boundaries of UTF-8 are easily detected, which is a huge advantage (it is in order to achieve this, we have to endure 3 bytes of 1 characters of the bitter days)

So what is the choice?

Because both UTF-8 and UTF-16/32 have advantages and disadvantages, so the choice should be based on the actual application scenario。 For example, in my habit, UTF-8 is used when stored on disk or for network switching, and is converted to UTF-16/32 when processed inside a program. For most simple programs, this ensures that the exchange of information is easy to achieve mutual compatibility, while internal processing will be relatively simple, performance is also good. (basically as long as your program is not I/O intensive can do so, of course, this is only my superficial understanding of the scope of experience, it is likely to be ruthless rebuttal)

Just a little bit more unfold ...

In some special fields, the choice of character encoding becomes a key issue. This is especially true in some high-performance network processing programs. Some special design techniques are used to alleviate the contradiction between performance and character set selection. For example, the content detection/filtering system needs to face any possible character encoding, and if you also use a variety of different encoding to convert to the same encoding after the processing of the scheme, then the performance degradation will be significant. However, if the finite state machine scheme supported by multi-character encoding is used, it can not only eliminate the conversion encoding, but also can deal with it with very high performance. Of course, how to generate a finite state machine from a list of rules, how to enable the finite state machine to support multi-coding, and what restrictions this will bring, has become another problem.

Unicode (UTF-8, UTF-16) confusing concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.