Also Talk About UTF-8 Coding

Source: Internet
Author: User

Also Talk About UTF-8 Coding

Earlier today, Node. js released an update that affects processing of invalid UTF-8 strings converted to the buffer. I had to check the UTF-8 Validation Code in websocket-driver again, and I found myself forgetting how to use regular expressions for verification. I copied it from the web page first. After a while, I finally figured out how it works. If the program you write is for text processing, you may also need to understand this, so I think I should write it down.

First you need to know that Unicode and UTF-8 are not the same thing. Unicode is a standard designed to assign limited numbers to all characters and characters in the world's writing systems. For example, the digit 65, or U + 0041, corresponds to the uppercase letter 'A', and the 90 Letter is the 'z' corresponding to the U + 005A ', 32/U + 0020 is a space. U + 02A4 is the character 'hangzhou', U + 046C is 'hangzhou', U + 0BF5 is 'hangzhou', and so on. In general, the number or 'Code point' ranges to U + 10FFFF, or 1,114,111.

A Unicode string, that is, a character sequence, is actually a sequence of numbers from 0 to 1,114.111. How these numbers are converted into the characters you see on the screen depends on the font you use to render it. When we send text out through a TCP connection or save it to a disk, we store it as a sequence of fixed-length bytes. An 8-bit byte can only represent 256 values. How can we represent 1,114,112 possible code points? This is the time for encoding.

UTF-8 is one of the many Unicode encodings. Encoding defines the ing between the byte sequence and the Code Point sequence, and tells us how to convert between them. UTF-8 is commonly used on the WEB encoding, and is used as the WebSocket protocol text message encoding.

So how does UTF-8 work? The first thing we need to know is that we cannot map all the code points to a single byte: The values of many code points are too large. Even we cannot use it to represent 00 to FF, because in this case, higher values cannot be expressed. However, we can use the range from 00 to 7F (0 to 127), leaving 80 to FF to represent other code points. The first 128 code points are represented by a single byte of low 7 bits:

  1. U + 0000 to U + 007F:
  2.  
  3. 20171000000--7f01111111

This is the uniqueness of UTF-8: it does not use 3 bytes to represent all the code points (1,114,111 is 21 bits), but uses a variable byte, from 1 byte to 4 byte. Each of the first 128 code points corresponds to one byte, and the remaining code points are expressed by the combination of the remaining 128 bytes (Note: Each 8-bit byte has 256 values, single-byte UTF-8 encoding uses 128 lower 7-bit, the rest for other code points ). There are two advantages to doing so, although one is mainly for programmers or English users. The first benefit is that the UTF-8 is backward compatible with ASCII: all valid ASCII documents are a valid UTF-8 document that corresponds one by one. The second advantage is that this is the first result. That is to say, we do not need to use two or three bytes for representation when transmitting English text.

We can use seven bits within the single-byte encoding range. To indicate a larger value, we need more bytes. The dual byte defined by the UTF-8 consists of byte pairs in the form of 110 xxxxx 10yyyyyy. The bits of x and y are variable, that is, 11 bits can be used. The sum is U + 07FF.

  1. U + 0080 to U + 07FF:
  2.  
  3. 11000010 C2 -- DF 11011111
  4. 1000000080 -- BF 10111111

That is to say, the Code Point U + 0080 becomes the byte C2 80, and the Code Point U + 07FF is the df bf. It should be noted that if the used space exceeds the actual needs, it is wrong: C1 BF or 11000001 10111111 will be understood as U + 007F, however, you can use only one byte to represent this code point. Therefore, C1 BF is not a legal byte sequence.

Generally, a multi-byte code point is composed of one or more 10xxxxxx bytes after a special bit (larger than 80 bytes, that is, a high value of 1. The available byte range is 80 to BF. The byte at the bottom of 80 is used as the single-byte code point. It is wrong if they appear in Multi-byte encoding. The value of the first byte tells us how many bytes are next to it.

Next we will continue to talk about the three-byte code points. They are in the form of 1110 xxxx 10 yyyyyy 10zzzzzz. We have 16 bits of data available so that our code points can reach U + FFFF. However, we have encountered a historical problem. Unicode was first described in the Unicode 88 White Paper, as mentioned above:

It is wise to extend the character encoding from 8 to 16 characters, so that there is still a point of shock at first thought. 16 bytes can provide up to 65536 different code values. Is this sufficient to encode all characters in the world? Since the definition of 'characters' is also part of the text encoding scheme design, it is meaningless to discuss this issue, unless the problem is changed to this: is it possible to recreate a valid character definition, so that the total number of characters in the world is less than 65536? The answer is yes. -Joseph D. Becker PhD, 'unicode 8 ′

Of course, the answer is no. You may have guessed that there are 1,114,112 code points. At the time of UTF-16 Design -- this is a fixed dubyte encoding specification -- 16 bits are found unable to encode all known characters. Therefore, the Unicode Standard retains a special code point interval so that the UTF-16 is used to encode a value greater than FFFF. These values are encoded in four bytes, that is, two standard code points. The range of the first two bytes is D8 00 to db ff, the range of the last two bytes is DC 00 to df ff. Code points in the U + D800 to U + DFFF range are also called proxies, and UTF-16 uses a proxy pair (surrogate pairs) to represent larger values. No characters will be assigned to these code points, and no encoding method will be used.

Therefore, for 3-byte encoding, we can only encode U + 0800 to U + D7FF and U + E000 to U + FFFF.

  1. U+ 0800 to U + D7FF:
  2.  
  3. 11100000 E0 -- ED 11101101
  4. 10100000 A0 -- 9F10011111
  5. 1000000080 -- BF 10111111
  6.  
  7. U + E000 to U + FFFF:
  8.  
  9. 11101110 EE -- EF 11101111
  10. 1000000080 -- BF 10111111
  11. 1000000080 -- BF 10111111'

Now we finally have 4 bytes. The format of these bytes is 11110www 10 xxxxxx 10 yyyyyy 10 zzzzzz. We have 21 bits available so that we can reach the maximum of U + 10 FFFF. There is no interval between these intervals, but we don't need to use the value of the entire range to overwrite the remaining code points, so the final result is as follows:

  1. U+ 010000 to U + 10 FFFF:
  2.  
  3. 11110000 F0 -- F4 11110100
  4. 1001000090--8F10001111
  5. 1000000080 -- BF 10111111
  6. 1000000080 -- BF 10111111

Now we have covered all the valid byte sequences that represent a single character in the UTF-8. They are:

  1. [00-7F]
  2. [C2-DF] [80-BF]
  3. E0 [A0-BF] [80-BF]
  4. [E1-EC] [80-BF] [80-BF]
  5. ED [80-9F] [80-BF]
  6. [EE-EF] [80-BF] [80-BF]
  7. F0 [90-BF] [80-BF] [80-BF]
  8. [F1-F3] [80-BF] [80-BF] [80-BF]
  9. F4 [80-8F] [80-BF] [80-BF]

These can be matched using a regular expression, but remember that regular expressions can only be operated on characters, rather than bytes. In Node, we can use buffer. toString ('binary ') converts a buffer into a string. The characters in the string are the literal values of the code points of these bytes (for example, from 0 to 255 ), then, use the regular expression to verify the string.

Now we have understood how UTF-8 is, and we can also understand what has been modified in Node.

  1. // Prior to these releases:
  2. NewBuffer ('AB \ ud800cd', 'utf8 ');
  3. // <Buffer 61 62 ed a0 80 63 64>
  4.  
  5. // After this release:
  6. NewBuffer ('AB \ ud800cd', 'utf8 ');
  7. // <Buffer 61 62 ef bf bd 63 64>

The character \ ud800 is a proxy with no corresponding encoding. Therefore, it is an invalid character. However, JavaScript allows this string to exist and does not throw errors. Therefore, Node should not report errors when deciding to convert this string to a buffer. But now this character is replaced with '\ ufffd', which is an unknown character. To prevent your program from sending a string that JS deems valid, but the other party refuses to acknowledge that it is a UTF-8 string, Node replaces it with a non-proxy character, to avoid downstream program errors. When it comes to a strange input, I usually suggest not to guess what the programmer really wants to express, but since Unicode provides such a code point, it is "used to replace a character that is unknown or cannot be expressed in Unicode", which seems to be a good choice.

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.