In C + + char* and string, the byte-stream encoding is used, that is, sizeof (char) = 1.
In other words, C + + is not a character-sensitive encoding.
The character length of a legal UTF8 may be a 1~4 bit.
Now suppose a string of input for UTF8 encoding, how can you pinpoint the "charpoint" of each UTF8 character without the wrong split character?
Refer to this page: http://www.nubaria.com/en/blog/?p=289
You can change the following function:
const unsigned char kfirstbitmask = 128;//1000000 const unsigned char ksecondbitmask = 64; 0100000 const unsigned char kthirdbitmask = 32; 0010000 const unsigned char kfourthbitmask = 16; 0001000 const unsigned char kfifthbitmask = 8;
0000100 int Utf8_char_len (char firstbyte) {std::string::d ifference_type offset = 1; if (Firstbyte & Kfirstbitmask)//This means the the "a", "the", "the", "," has a value greater than, and is 127 the ASCII RA
Nge. {if (Firstbyte & Kthirdbitmask)//This means that the ' the ' the ' the ' the ' the ' a value greater than 224, and so it must b
E at least a three-octet the code point. {if (Firstbyte & Kfourthbitmask)//This means that the ' the ' the ' the ' the ' a value greater than
t be a four-octet the code point.
offset = 4;
else offset = 3;
else {offset = 2;
} return offset; }