Std: Implementation of substr in Multi-Byte Character Set, stdsubstr
Yesterday I wrote "cross-platform (PC, Android, IOS, WP) encoding/decoding method using multi-byte character sets", and mentioned that the server uses std: string to process strings, std :: the support for multi-byte character sets in string is not perfect. Functions in std: string do not directly support the multi-byte character set.
For example, directly calling the std: string substr function may lead to invalid characters at the end of the string intercepted in some cases.
Basic knowledge of the GB series multi-Byte Character Set:
In the VC environment, the project is set to a multi-byte character set. The GBK encoding is used by default, and GB2312, GBK, and GB18030 are both Chinese encoding methods and backward compatible.
1. GB2312 contains more than 7000 Chinese characters and characters, GBK contains more than 21000 characters, and GB18030 contains more than 27000 characters.
2. Chinese Characters in GBK are expressed in double byte, and English characters are represented in ASCII code, that is, single byte.
3. the GBK encoding table also has a dual-byte representation of English characters. Therefore, English letters can contain 2 GBK representation.
4. Set the maximum bit of a Chinese character in GBK encoding to 1, and the maximum bit of a single English character in a single byte to 0.
5. When GBK is used for decoding, if the maximum bit of a high byte is 0, it is decoded using an ASCII code table. If the maximum bit of a high byte is 1, it is decoded using a GBK encoded table.
The above 5 points can explain why the substr in std: string produces invalid characters at the end. The substr only takes the length of the byte into consideration and does not consider the multi-byte character set encoding.
For strings truncated by substr, NSString initialization fails in IOS, while String type in Android can tolerate invalid characters.
To completely solve the platform compatibility problem, you must implement the truncation function by yourself:
int GbkSubString(const char *s, int iLeft) { int len = 0, i = 0; if( s == NULL || *s == 0 || iLeft <= 0 ) return(0); while( *s ) { if( (*s & 0x80) == 0 ) { i ++; s ++; len ++; } else { if( *(s + 1) == 0 ) break; i += 2; s += 2; len += 2; } if( i == iLeft ) break; else if( i > iLeft ) { len -= 2; break; } } return(len); }
First, use the GbkSubString function to process the length, and then call substr with the returned exact length.
Record, for better yourself!