Decoding UTF-8

Source: Internet
Author: User

For more information about UTF-8 coding rules, see my article: http://blog.csdn.net/sheismylife/article/details/8570015

In my other article "UTF-8 encoding test" http://blog.csdn.net/sheismylife/article/details/8571726, I used the boost: locale library code to decode the UTF-8. Now I'll take a closer look at the decoding algorithm:

How can we distinguish leading byte from continuation bytes? The key is that any continuation byte starts with 10. The following function can help you determine whether the value is a continuation byte:

bool is_trail(char ci) {  unsigned char c = ci;  return (c & 0xC0) == 0x80;}

Because the 0xc0 binary format is 1100 0000, and C is set to 0 in bitwise AND followed by the lower six bits, the two high bits of C are retained.

The 0x80 binary format is 1000 0000. If the two values are equal, the two values of C are 10, so C is a continuation byte. Returns true.

With this function, it is easy to determine whether a byte is leading byte:

bool is_lead(char ci) {  return !is_trail(ci);}

Let's take a look at the trail_length function. This function analyzes a leading byte to determine the length of the continuation bytes.

int trail_length(char ci) {  unsigned char c = ci;  if(c < 128)    return 0;  if(BOOST_LOCALE_UNLIKELY(c < 194))    return -1;  if(c < 224)    return 1;  if(c < 240)    return 2;  if(BOOST_LOCALE_LIKELY(c <=244))    return 3;  return -1;}

If C <128, it is an ASCII character, expressed in one byte. Therefore, the length of conitunuation bytes is 0, and the total length of UTF-8 encoding is 1.

Because 11011111 is 223, As long as C is in the range [128,224), it indicates that the length of continuation bytes is 1, and the total length of UTF-8 encoding is 2.

Continue to infer that 11101111 equals 239. Therefore, in the [224,240) range, the length of the continuation bytes is 2, and the total length of the UTF-8 encoding is 3.

11110111 is equal to 247, so in the [240,248) range, the length of the continuation bytes is 3, and the total length of the UTF-8 encoding is 4. however, the range defined here is actually [240,244]. I don't know why, or what encoding rules may exist.

Now let's take a look at the code in UTF. HPP, http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/utf_8hpp_source.html

00192         template<typename Iterator>00193         static code_point decode(Iterator &p,Iterator e)00194         {00195             if(BOOST_LOCALE_UNLIKELY(p==e))00196                 return incomplete;00197 00198             unsigned char lead = *p++;00199 00200             // First byte is fully validated here00201             int trail_size = trail_length(lead);00202 00203             if(BOOST_LOCALE_UNLIKELY(trail_size < 0))00204                 return illegal;00205 00206             //00207             // Ok as only ASCII may be of size = 000208             // also optimize for ASCII text00209             //00210             if(trail_size == 0)00211                 return lead;00212             00213             code_point c = lead & ((1<<(6-trail_size))-1);00214 00215             // Read the rest00216             unsigned char tmp;00217             switch(trail_size) {00218             case 3:00219                 if(BOOST_LOCALE_UNLIKELY(p==e))00220                     return incomplete;00221                 tmp = *p++;00222                 if (!is_trail(tmp))00223                     return illegal;00224                 c = (c << 6) | ( tmp & 0x3F);00225             case 2:00226                 if(BOOST_LOCALE_UNLIKELY(p==e))00227                     return incomplete;00228                 tmp = *p++;00229                 if (!is_trail(tmp))00230                     return illegal;00231                 c = (c << 6) | ( tmp & 0x3F);00232             case 1:00233                 if(BOOST_LOCALE_UNLIKELY(p==e))00234                     return incomplete;00235                 tmp = *p++;00236                 if (!is_trail(tmp))00237                     return illegal;00238                 c = (c << 6) | ( tmp & 0x3F);00239             }00240 00241             // Check code point validity: no surrogates and00242             // valid range00243             if(BOOST_LOCALE_UNLIKELY(!is_valid_codepoint(c)))00244                 return illegal;00245 00246             // make sure it is the most compact representation00247             if(BOOST_LOCALE_UNLIKELY(width(c)!=trail_size + 1))00248                 return illegal;00249 00250             return c;00251 00252         }

The decode function parses a UTF-8 encoded string (may contain 1-4 bytes) and returns the corresponding code point, a uint32_t integer.

It is always assumed that the first byte is leading byte, and then use trail_length to get the length of the continuation bytes. If it is 0, it is an ASCII character and is returned directly.

All rows from 210 to 250 are non-ASCII characters. Pay attention to switch/case usage first. There is no break statement here. That is to say, if trail_size is 3, the statement in case 3: is executed first, then execute case 2 and Case 1 in sequence (no matching is required ). This is a special syntax from switch/case. Reference: http://msdn.microsoft.com/en-US/library/k0t5wee3. aspx

Of course, loop can be better expressed, but I don't know why Artyom chose this method? Is performance higher? There must be at least one code scanning tool. If there is a code scanning tool, the three pieces of code are duplicated and a warning is reported. :)

Now let's look at the Wiki example referenced in the previous article on UTF-8 encoding:

Now let's take a look at an example from wiki to demonstrate how to encode the character € in UTF-8: Step 1: Get the Unicode code point of €, Which is 0xu + 20 acstep 2: 0xu + 20ac ranges from u + 07ff to U + FFFF, so it is represented in three bytes. The binary code of Step 3: 0xu + 20ac is 10000010101100,14 characters long. To represent a 3-byte encoding, it must be 16 bits. therefore, the value of two zeros is increased to 2 bytes and 16-Bit Length: 0010000010101100, which is called a numeric string. Step 4: Add a leading byte according to the rule, starting with 1110. Then, this leading byte still has four bits to be filled. The leading byte is changed to four bits from the high position of the numeric string: 11100010, while the numeric string value is 000010101100 Step 5: The first continuation byte should be at the top of 10, but 6 bits are missing, and 6 bits should be taken from the numeric string to the top, so that the first continuation byte is: 10000010, And the numeric string to 101100 Step 6: The second continuation byte high should also be 10, there is still a lack of 6 bits, take 6 bits from the numeric string, so the second continuation byte is: 10101100 three bytes of final encoding: 11100010 10000010 10101100 written in hexadecimal format: 0xe282ac

Decoding is the inverse process of encoding. It extracts 4 lower bits from leading byte, extracts 6 lower bits from both continuation bytes, and concatenates them into 16 bits integers, then the conversion type is changed to the uint32_t integer.

The following code extracts four lower digits:

code_point c = lead & ((1<<(6-trail_size))-1);

This is a good technique. To sum up, you can write a function. The function accepts two parameters: one is to extract the bit X and the other is the number of digits to be extracted.

uint8_t GetLowNBit(uint8_t x, uint8_t n) {  return x & ((1<<n)-1);}

Why do we use 6-trail_size here, which is purely an observed rule.

For 2-byte UTF-8 encoding, leading byte starts with 110. Therefore, we need to extract a low 5-bit value, 6-trail_size = 6-1 = 5, which is exactly as low as 5 bytes.

3-byte UTF-8 encoding. Leading byte starts with 1110. Therefore, we need to extract a low 4-bit value, 6-trail_size = 6-2 = 4.

4-byte UTF-8 encoding. Leading byte starts with 11110. Therefore, we need to extract a low 3-bit value, 6-trail_size = 6-3 = 3.

So here we use 6, Artyom's observation is very keen.

After extracting the low-level data of leading byte, We need to extract the low-level data of continuation bytes, that is, the switch/case function.

This is simple. tmp & 0x3f is used to retrieve the low 6-bit data, because the continuation byte always starts with 10. After 6 digits are removed each time, C is shifted to 6 digits and then bitwise OR is used to merge the bits into new integers.

We have explained the usage of swtich/case in without break. This operation runs twice, extracts the Lower 6-bit data of the last two continuation bytes, and merges the data.

UTF-8 decoding algorithm analysis is complete.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.