For more information about UTF-8 coding rules, see my article: http://blog.csdn.net/sheismylife/article/details/8570015
In my other article "UTF-8 encoding test" http://blog.csdn.net/sheismylife/article/details/8571726, I used the boost: locale library code to decode the UTF-8. Now I'll take a closer look at the decoding algorithm:
How can we distinguish leading byte from continuation bytes? The key is that any continuation byte starts with 10. The following function can help you determine whether the value is a continuation byte:
bool is_trail(char ci) { unsigned char c = ci; return (c & 0xC0) == 0x80;}
Because the 0xc0 binary format is 1100 0000, and C is set to 0 in bitwise AND followed by the lower six bits, the two high bits of C are retained.
The 0x80 binary format is 1000 0000. If the two values are equal, the two values of C are 10, so C is a continuation byte. Returns true.
With this function, it is easy to determine whether a byte is leading byte:
bool is_lead(char ci) { return !is_trail(ci);}
Let's take a look at the trail_length function. This function analyzes a leading byte to determine the length of the continuation bytes.
int trail_length(char ci) { unsigned char c = ci; if(c < 128) return 0; if(BOOST_LOCALE_UNLIKELY(c < 194)) return -1; if(c < 224) return 1; if(c < 240) return 2; if(BOOST_LOCALE_LIKELY(c <=244)) return 3; return -1;}
If C <128, it is an ASCII character, expressed in one byte. Therefore, the length of conitunuation bytes is 0, and the total length of UTF-8 encoding is 1.
Because 11011111 is 223, As long as C is in the range [128,224), it indicates that the length of continuation bytes is 1, and the total length of UTF-8 encoding is 2.
Continue to infer that 11101111 equals 239. Therefore, in the [224,240) range, the length of the continuation bytes is 2, and the total length of the UTF-8 encoding is 3.
11110111 is equal to 247, so in the [240,248) range, the length of the continuation bytes is 3, and the total length of the UTF-8 encoding is 4. however, the range defined here is actually [240,244]. I don't know why, or what encoding rules may exist.
Now let's take a look at the code in UTF. HPP, http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/utf_8hpp_source.html
00192 template<typename Iterator>00193 static code_point decode(Iterator &p,Iterator e)00194 {00195 if(BOOST_LOCALE_UNLIKELY(p==e))00196 return incomplete;00197 00198 unsigned char lead = *p++;00199 00200 // First byte is fully validated here00201 int trail_size = trail_length(lead);00202 00203 if(BOOST_LOCALE_UNLIKELY(trail_size < 0))00204 return illegal;00205 00206 //00207 // Ok as only ASCII may be of size = 000208 // also optimize for ASCII text00209 //00210 if(trail_size == 0)00211 return lead;00212 00213 code_point c = lead & ((1<<(6-trail_size))-1);00214 00215 // Read the rest00216 unsigned char tmp;00217 switch(trail_size) {00218 case 3:00219 if(BOOST_LOCALE_UNLIKELY(p==e))00220 return incomplete;00221 tmp = *p++;00222 if (!is_trail(tmp))00223 return illegal;00224 c = (c << 6) | ( tmp & 0x3F);00225 case 2:00226 if(BOOST_LOCALE_UNLIKELY(p==e))00227 return incomplete;00228 tmp = *p++;00229 if (!is_trail(tmp))00230 return illegal;00231 c = (c << 6) | ( tmp & 0x3F);00232 case 1:00233 if(BOOST_LOCALE_UNLIKELY(p==e))00234 return incomplete;00235 tmp = *p++;00236 if (!is_trail(tmp))00237 return illegal;00238 c = (c << 6) | ( tmp & 0x3F);00239 }00240 00241 // Check code point validity: no surrogates and00242 // valid range00243 if(BOOST_LOCALE_UNLIKELY(!is_valid_codepoint(c)))00244 return illegal;00245 00246 // make sure it is the most compact representation00247 if(BOOST_LOCALE_UNLIKELY(width(c)!=trail_size + 1))00248 return illegal;00249 00250 return c;00251 00252 }
The decode function parses a UTF-8 encoded string (may contain 1-4 bytes) and returns the corresponding code point, a uint32_t integer.
It is always assumed that the first byte is leading byte, and then use trail_length to get the length of the continuation bytes. If it is 0, it is an ASCII character and is returned directly.
All rows from 210 to 250 are non-ASCII characters. Pay attention to switch/case usage first. There is no break statement here. That is to say, if trail_size is 3, the statement in case 3: is executed first, then execute case 2 and Case 1 in sequence (no matching is required ). This is a special syntax from switch/case. Reference: http://msdn.microsoft.com/en-US/library/k0t5wee3. aspx
Of course, loop can be better expressed, but I don't know why Artyom chose this method? Is performance higher? There must be at least one code scanning tool. If there is a code scanning tool, the three pieces of code are duplicated and a warning is reported. :)
Now let's look at the Wiki example referenced in the previous article on UTF-8 encoding:
Now let's take a look at an example from wiki to demonstrate how to encode the character € in UTF-8: Step 1: Get the Unicode code point of €, Which is 0xu + 20 acstep 2: 0xu + 20ac ranges from u + 07ff to U + FFFF, so it is represented in three bytes. The binary code of Step 3: 0xu + 20ac is 10000010101100,14 characters long. To represent a 3-byte encoding, it must be 16 bits. therefore, the value of two zeros is increased to 2 bytes and 16-Bit Length: 0010000010101100, which is called a numeric string. Step 4: Add a leading byte according to the rule, starting with 1110. Then, this leading byte still has four bits to be filled. The leading byte is changed to four bits from the high position of the numeric string: 11100010, while the numeric string value is 000010101100 Step 5: The first continuation byte should be at the top of 10, but 6 bits are missing, and 6 bits should be taken from the numeric string to the top, so that the first continuation byte is: 10000010, And the numeric string to 101100 Step 6: The second continuation byte high should also be 10, there is still a lack of 6 bits, take 6 bits from the numeric string, so the second continuation byte is: 10101100 three bytes of final encoding: 11100010 10000010 10101100 written in hexadecimal format: 0xe282ac
Decoding is the inverse process of encoding. It extracts 4 lower bits from leading byte, extracts 6 lower bits from both continuation bytes, and concatenates them into 16 bits integers, then the conversion type is changed to the uint32_t integer.
The following code extracts four lower digits:
code_point c = lead & ((1<<(6-trail_size))-1);
This is a good technique. To sum up, you can write a function. The function accepts two parameters: one is to extract the bit X and the other is the number of digits to be extracted.
uint8_t GetLowNBit(uint8_t x, uint8_t n) { return x & ((1<<n)-1);}
Why do we use 6-trail_size here, which is purely an observed rule.
For 2-byte UTF-8 encoding, leading byte starts with 110. Therefore, we need to extract a low 5-bit value, 6-trail_size = 6-1 = 5, which is exactly as low as 5 bytes.
3-byte UTF-8 encoding. Leading byte starts with 1110. Therefore, we need to extract a low 4-bit value, 6-trail_size = 6-2 = 4.
4-byte UTF-8 encoding. Leading byte starts with 11110. Therefore, we need to extract a low 3-bit value, 6-trail_size = 6-3 = 3.
So here we use 6, Artyom's observation is very keen.
After extracting the low-level data of leading byte, We need to extract the low-level data of continuation bytes, that is, the switch/case function.
This is simple. tmp & 0x3f is used to retrieve the low 6-bit data, because the continuation byte always starts with 10. After 6 digits are removed each time, C is shifted to 6 digits and then bitwise OR is used to merge the bits into new integers.
We have explained the usage of swtich/case in without break. This operation runs twice, extracts the Lower 6-bit data of the last two continuation bytes, and merges the data.
UTF-8 decoding algorithm analysis is complete.