Consider UTF8 coded regular expressions (PHP version) when encountering problems with code recognition
Recently encountered one thing, an interface can receive incoming encoding may be UTF-8,GBK two kinds. The person who has done the encoding aspect should know, what code does not have what tag bit inside the string. However, the UTF-8 encoding is special, so it can be checked by regular expressions. As long as the discovery is utf-8 encoded. The conversion, not utf-8, is treated as GBK. Coding some common problems can be viewed: By the Web program garbled start mining (BOM header, character set and garbled)
Know this principle, get the task right away and start working. Think of PHP version has a mbstring module can be encoded to detect the conversion:
PHP//Current encoding is GBK$str = " China "; $aStrList =array ($str, Iconv ('gbk', 'utf-8', $str)); foreach ($aStrList as $v) {echo mb_convert_encoding ($v, 'gbk', 'utf-8,gbk'), "\ r \ n";}
?
Operation Result:
?
Two different coded "China" can be automatically converted to GBK encoding with a function mb_convert_encoding. Home, try to decode with utf-8, if there is a problem, it will be used GBK transcoding. It seems that the problem has been solved, haha, can be ...
?
- Problem:
After the release, calm a few days, suddenly received feedback: there is a Chinese: "Yuan small" decoding error. ⊙﹏⊙ b Khan ... and want to .... (Is there a problem with the PHP built-in detection module, or where I lack ...)
⊙﹏⊙ b Khan ... It seems to be a problem, check the manual:
?
- Problem:
Can you write a check on your own regular expression to see what's going on? To write a regular expression, you must first understand the UTF8 encoding specification, view: Http://zh.wikipedia.org/zh/UTF-8?
Currently, there are only 6 dimensions of the encoding set: PHP Gets the dimension code
Php//Get the range of each dimension of UTF8 word encodingEcho Base_convert ('1111111', 2,16), '\ r \ n";//Dimension 1Echo Base_convert ('10000000', 2,16), Base_convert ('10111111', 2,16), '\ r \ n"; Echo Base_convert ('11000000', 2,16), Base_convert ('11011111', 2,16), '\ r \ n";//Dimension 2Echo Base_convert ('11100000', 2,16), Base_convert ('11101111', 2,16), '\ r \ n";//Dimension 3Echo Base_convert ('11110000', 2,16), Base_convert ('11110111', 2,16), '\ r \ n";//Dimension 4Echo Base_convert ('11111000', 2,16), Base_convert ('11111011', 2,16), '\ r \ n";//Dimension 5Echo Base_convert ('11111100', 2,16), Base_convert ('11111101', 2,16), '\ r \ n";//Dimension 6
Operation Result:
- The corresponding regular expression is obtained through the above 6 dimensions:
[\x01-\x7f]| [\XC0-\XDF] [\x80-\xbf]| [\xe0-\xef] [\X80-\XBF] {2}| [\xf0-\xf7] [\X80-\XBF] {3}| [\XF8-\XFB] [\X80-\XBF] {4}| [\XFC-\XFD] [\X80-\XBF] {5}
These are the dimensions of each dimension, respectively.
Php//the current encoding is GBK $str = " Yuan "; Echo UrlEncode ($STR); Echo Is_utf8 ($STR); function Is_utf8 ($str) {///utf8 coded regular detection function ///copyright qq:8292669/HTTP/ Www.cnblogs.com/chengmo $re = '
The above execution results are returned as 1, and then "Yuan" itself should be GBK encoded. It seems that the above function is still unable to thoroughly check the UTF8 encoding. Analysis of the reason, from the above can be seen, the UTF8 6 dimensions corresponding byte length from 1-6 bytes. And GBK is 1-2 bytes. So they will check for overlap between 1-2 byte lengths. 1 bytes when the encoding of GBK and UTF8 is the same as the character correspondence, but 2 bytes, the corresponding encoding and character are different.
?
By querying the GBK encoding table: Http://www.knowsky.com/resource/gb2312tbl.htm further confirms that the scope will be:
[C0-DF] [A0-BF] if the Chinese character combination of pure this range is a string, it will not be able to judge the situation. It can be correctly judged if it is combined with any other range of characters.
?
GBK characters that correspond to the UTF8 character set overlap are: (GBK encoded table)
?
?
123456789101112131415161718192021st22232425262728293031323334353637383940414243444546474849505152535455565758596061 |