Unicode (UTF-8, UTF-16) confusing concepts

Source: Internet
Author: User
Tags parse error
Unicode

We know that the computer is actually stupid. It only knows strings like 0101. Of course, we certainly feel dizzy when we look at the 01 string, so many times we simply use decimal to describe it, hexadecimal format, which is octal. in fact, they are all equivalent. There are not many differences. other Dongdong computers such as text images do not know each other. in order to represent the information on the computer, it must be converted into numbers. you must set some rules for conversion. so at the beginning, there was an ASCII character set (American
Standard Code for information interchange, "American Information Exchange Standard Code), which uses 7 bits to represent a character, representing a total of 128 characters, we generally use byte, 8 01 strings) as the basic unit. when one byte is used to represent a character, the first bit is always 0, and the remaining seven bytes are used to represent the actual content. later, the IBM company expanded on this basis and represented a character with 8 bits, which can represent a total of 256 characters. that is, when the first bit is 0, it still indicates the common characters before. when it is 1, it indicates other supplementary characters.

The number of English letters plus some other punctuation characters cannot exceed 256. one byte indicates that the master is sufficient. however, there are more than tens of thousands of other texts, such as Chinese characters. so there are other character sets. in this way, there is a problem when different character sets exchange data. maybe you use a number to represent character a, but another character set uses another number to represent character. in this way, interaction is troublesome. as a result, organizations such as Unicode and ISO come up with a unified standard. Each character corresponds to only one definite number. the name of ISO is "ucal Character Set", and the name of Unicode is "Unicode.

To sum up, unicodey is needed to adapt to the development of globalization and facilitate compatible interaction between different languages, while ASCII is no longer competent for this task.

Unicode

1. Two bytes that are prone to post-Ambiguity

The first version of Unicode uses two bytes (16 bits) to represent all characters.

. In fact, this is easy to produce ambiguity. We always think that two bytes represent two bytes stored in the computer. therefore, any character stored in Unicode occupies two bytes. in fact, this statement is incorrect.

In fact, Unicode involves two steps. The first step is to define a specification and specify a unique number for all characters. This is completely a mathematical problem and can be unrelated to computers. the second step is how to store the numbers corresponding to the characters in the computer, which involves the actual size of the bytes occupied by the computer.

So we can also understand that Unicode represents all characters using numbers between 0 and 65535. the numbers 0 to 127 represent the same characters as ASCII. 65536 is the 16th power of 2. this is the first step. the second step is how to convert the numbers 0 to 65535 into 01 strings and save them to the computer. this certainly has different storage methods. so UTF (Unicode Transformation Format) appeared, there is a UTF-8, UTF-16.

2. Differences between UTF-8 and UTF-16

UTF-16 is better understood, that is, any character corresponding to the number are saved in two bytes. our common misunderstanding of Unicode is to equate Unicode with UTF-16. but obviously it is a waste of English letters. obviously, one word saves energy to indicate why one character is two characters in length.

So there is another UTF-8, here 8 is very easy to mislead people, 8 is not a byte, is a byte represents a character? Actually not. it is variable to indicate a character in UTF-8, either a byte or two or three. of course, it cannot exceed 3 bytes. it is determined based on the number size corresponding to the character.

So the advantages and disadvantages of UTF-8 and UTF-16 is easy to see. if all English or English and other text mixed, but English accounted for the vast majority of, with UTF-8 than the UTF-16 to save a lot of space. if all the characters are similar to Chinese characters or mixed characters, Chinese accounts for the vast majority. UTF-16 is dominant, can save a lot of space. there is also a fault tolerance issue. I will discuss it later.

 

It seems a bit dizzy. For example, if the Unicode corresponding to the Chinese character "Han" is 6c49 (this is represented in hexadecimal notation, and 27721 in decimal notation, why not use decimal notation? It is obvious that it is short in hexadecimal notation. in fact, they are all equivalent. it's just like 60 minutes and 1 hour .). you might ask how we know it's a UTF-8 or a UTF-16 when a program opens a file. naturally, there will be some signs. The first few bytes of the file are the marks.

Ef bb bf indicates UTF-8

Fe FF indicates UTF-16.

Represent "Han" with UTF-16"

If the UTF-16 is expressed as 01101100 01001001 (16 bits in total, two bytes). When the program parses the known is the UTF-16, the two bytes as a unit to parse. This is very simple.

Represent "Han" with UTF-8"

There is a complexity with UTF-8. at this time, the program reads one byte and one byte, then, identify whether one or two or three bytes should be processed as a unit based on the bit flag in the header.

0 XXXXXXX. If it is such a 01 string, it will start with 0, so you don't have to worry about it. XX represents any bit. indicates that a byte is used as a unit. it is exactly the same as ASCII.

110 XXXXX 10xxxxxx. in this format, take two bytes as one unit

1110 XXXX 10 xxxxxx 10 xxxxxx if this format is used, three bytes are used as a unit.

This is an agreed rule. you must follow this rule when expressing it in UTF-8. we know that UTF-16 does not need to use what character to mark, so two bytes is 2 16 times can represent 65536 characters.

Because of the extra sign information in the UTF-8, all one byte can only represent 2 7 to 128 characters, two bytes can only represent 2 to 11 to 2048 characters. the three-character energy saving means the power of 2 to the power of 16, 65536 characters.

Because the encoding of "Han" is greater than 27721 and all two bytes are not enough, it can only be represented by three bytes.

All must use the format 1110 XXXX 10 xxxxxx 10xxxxxx. fill in the xxx symbol (not necessarily from left to right, or from right to left) for the binary value of 27721. This involves another problem. yes.

The filling method can be different, so the big-Endian and little-Endian terms appear. Big-Endian is left-to-right, and little-Endian is right-to-left.

From the above we can see that the UTF-8 needs to judge the beginning of each byte mark information, so when a byte in the transfer process error, it will cause the subsequent bytes will also Parse error. and the UTF-16 will not judge the Starting Sign, even if the error will only be wrong one character, so strong fault tolerance.

Unicode Version 2

All of the above are the first Unicode version. however, 65536 is clearly not too many numbers. It is okay to use it to represent common characters. it is enough, but it is not enough to add many special ones. so a second version was introduced since January 1, 1996. all characters are represented in four bytes. so there is a UTF-8, UTF16, UTF-32. the principle is certainly the same as before, UTF-32 is to all the characters are expressed in 32 bit, that is, 4 bytes. then UTF-8, UTF-16 depends on the situation. the UTF-8 can be represented by one to eight bytes. the UTF-16 can only be two or four bytes .. since the principles of Unicode version 2 are the same, I will not talk about it much.

As mentioned above, we need to know the specific encoding method and determine the mark at the beginning of the text. below is the mark at the beginning of all codes.

Ef bb bf UTF-8
Fe FF UTF-16/UCS-2, little endian
FF Fe UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.

Where the UCS is the ISO standard, and Unicode is exactly the same, but the name is not the same. The ucs-2 corresponds to the UTF-16, The ucs-4 corresponds to the UTF-32.UTF-8 is not the corresponding UCS

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.