UTF-8 of the Unicode implementation of "character encoding series four"

Source: Internet
Author: User

Before starting this article, I've already made a distinction between Unicode encoding (that is, code point) and Unicode encoding implementation. Otherwise, you will have no sense in the following.


We know that the ISO 10646 committee defines a super character set called Universal Character Set (UCS) to encompass all the writing systems in the world. Because the UCS is now encoded in 4 bytes, it is implemented in scenarios such as UTF-16 and UTF32, because these implementations (note that the non-coding scheme) are multibyte, causing it to be incompatible with US-ASCII-related systems. UTF-8 was born. One of UTF-8 's mission is to encode the ASCII representation of the character, and the code for this scenario is exactly the same as ASCII.

Historically, UTF-8 used 1~6 bytes to encode characters, meaning that the corresponding abstract code point could reach U+7fffffff. However, it is found that for the code point, there is no more than 4 byte encoding, only 21 bits can fully contain all the world's writing system, that is, the legal code point is 0X0000~0X10FFFF. So the code point range that is valid with Unicode is 0x0000 0000 ~0x0010 FFFF,RFC 3629 announces that the previous UTF-8 (RFC 2279) standard is obsolete, and the new standard UTF-8 uses 1~4 bytes to encode characters.

There are five types of Unicode-related encodings: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. If someone has to add UTF-7, that's OK.
Subsequent articles in this series will turn to all of the actual coding scenarios described above.


UTF-8, as the name implies, is measured in bytes. This is not to say that a character encoded with UTF-8 is a byte, but rather that it is incremented in bytes, 1 bytes, 2 bytes, 3 bytes, 4 bytes.
UTF-16, however, is incremented in 16-bit increments, 2 bytes, and 4 bytes.
UTF-8 and UTF-16 are both variable-length encodings, and we can clearly see how the UTF-8 is implemented by the following table:

     Character  code point space    |        UTF-8 byte sequence
      (hexadecimal representation)     |         (binary representation)
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxx XXX 10xxxxxx

The X on the right represents one of the binaries.

First, it is clear from the above that, for us-ascii characters, UTF-8 and ASCII encoding are exactly the same. This is a favorite of the old American.

At the same time, it can be seen from the coding implementation of UTF-8 that it belongs to self-synchronizing encoding. That is, when we read a byte, if it starts with 0, then this byte is the encoding of the character, and if it starts with 110, then this byte is the next byte, which is the encoding of the character, and so on. In other words, the first byte indicates that the length of this character encoding is several bytes. Because UTF-8 is self-synchronizing, we can start reading from any position in the byte stream, skipping up to 3 bytes, and we can find the bounds of the character.

In the UTF-8 encoding, C0, C1, f5~ff are not allowed to appear. That's the rule. At the same time, for code that is u+d800~u+dfff between code points, UTF-8 is refusing to encode it UTF-8. This is because for UTF-16, this interval is used as a surrogate for the child and is not a direct representation of the character. This provision is made in order to maintain the character one by one correspondence with the UTF-16. (Aside from that, we can see that there are 2048 u+d800~u+dfff, that is, 2K code points.) The conversion between the general UTF-8 and UTF-16 is also achieved through intermediate Unicode code points.

For decoding, also need to pay attention to, for example, seems to conform to UTF-8 C080 encoding, is actually illegal code, in decoding to be skipped.

1100 0000 1000 0000

If we do decode the above table, the code point we get is u+0000, which is obviously wrong.

Again, for those who want to implement the UTF-8 conversion, notice the security issues that may be caused by the illegal UTF-8 sequence.

Speaking of UTF-8, you have to talk about the BOM (byte order mark), also called the byte order mark.
In Unicode-encoded code points, U+feff is called a 0 wide non-breaking space (ZERO width no-break space). For other encodings, this is the famous BOM, in the new standard, we only use U+feff for the character stream switch to represent the size end.
However, for UTF-8 in increments of bytes, it is meaningless to represent the size end, because the size end refers to how the bytes in the unit are stored. So for UTF-8, it replaced U+feff with EFBBBF. This is the UTF-8 BOM, that is, as long as the beginning of the EFBBBF, it indicates that the character stream encoding method is UTF-8.



CJK (Chinese, Japanese, Korean) or CJKV (Chinese, Japanese, Korean, and Vietnamese) have three or four words that take up 0x3000 to 0x9fff in Unicode, and we can clearly see from the conversion table above that the corresponding UTF-8 encoding for this section is three bytes. Yes, as long as it is a Chinese character encoded with UTF-8, it must be three bytes. This is a hole. (To say a word, hope has already seen here the classmate, has already clarified the Unicode code and the UTF series realization difference) The California procedure is passing by.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: