Think about utf8 encoding Regular Expressions (PHP version)

Source: Internet
Author: User
    • Cause:

One thing I encountered recently is that an interface can receive two types of incoming encoding: UTF-8 and GBK. Those who have done encoding conversion should know that the encoding does not have any mark bits in the string. However, UTF-8 encoding is special, so you can use regular expressions to check. As long as UTF-8 encoding is found. If it is not UTF-8, it should be processed by GBK. For more information about encoding, see: start mining with garbled web programs (BOM header, Character Set, and garbled code)

    • Action:

Get the task and start working immediately. I think that PHP has an mbstring module which can be used for encoding detection and conversion:

 
<?PHP// The current encoding is GBK$ STR ="China"; $ Astrlist = array ($ STR, iconv ('GBK','UTF-8', $ Str); foreach ($ astrlist as $ v) {echo mb_convert_encoding ($ V ,'GBK','UTF-8, GBK'),"\ R \ n";}
 
 
 
Running result:
 
 
 
 
Two different codes of "China" can be automatically converted to GBK encoding using the mb_convert_encoding function. Homepage. Try to use UTF-8 decoding. If there is a problem, it will use GBK for transcoding. It seems that the problem has been solved. Haha, you can close the gap ......
 
 
    1. Problem:
 
After the release, after several days of calm, I suddenly received a feedback: Chinese: "Yuan Xiao" decoding error. ⊙ B Khan ...... , Want ...... (Is there a problem with the PHP built-in detection module or where I am missing ......)
 
 
 
⊙ B Khan ...... It seems that there is a problem. query the manual:The encoding check of the mbstring module only recognizes part of the encoding of the string. if it finds that it matches a character set, it is considered to belong to that encoding. This does not belong to its bug, because the string itself does not have the encoding information identifier, and no language can completely detect it.
 
 
    1. Problem:
 
Can you write a regular expression to check the regular expression? To write a regular expression, you must first understand the utf8 encoding specification, view: http://zh.wikipedia.org/zh/UTF-8

Currently, the encoding set has only the following six dimensions: PhP obtains the dimension.Code

 <? PHP // Obtain the range of dimensions for utf8 encoding. Echo base_convert ('1111111 ', 2, 16 )," \ R \ n "; // Dimension 1 Echo base_convert (' 10000000 ', 2, 16), base_convert (' 10111111 ', 2, 16 )," \ R \ n "; Echo base_convert (' 11000000 ', 2, 16), base_convert (' 11011111 ', 2, 16 )," \ R \ n "; // Dimension 2 Echo base_convert (' 11100000 ', 2, 16), base_convert ('11101111 ', 2, 16 )," \ R \ n "; // Dimension 3 Echo base_convert (' 11110000 ', 2, 16), base_convert (' 11110111 ', 2, 16 )," \ R \ n "; // Dimension 4 Echo base_convert (' 11111000 ', 2, 16), base_convert (' 11111011 ', 2, 16 )," \ R \ n "; // Dimension 5 Echo base_convert (' 11111100 ', 2, 16), base_convert (' 11111101 ', 2, 16 )," \ R \ n "; // Dimension 6 

Running result:

    1. Use the preceding six dimensions to obtain the corresponding regular expression:

[\ X01-\ x7f] | [\ xc0-\ xdf] [\ X80-\ xbf] | [\ xe0-\ XeF] [\ X80-\ xbf] {2} | [\ xf0-\ xf7] [\ X80-\ xbf] {3} | [\ xf8-\ xfb] [\ X80-\ xbf] {4} | [\ xfc -\ XFD] [\ X80-\ xbf] {5}

These are the ranges of different dimensions.

<?PHP// The current encoding is GBK$ STR ="Yuan"; Echo urlencode ($ Str); echo is_utf8 ($ Str); function is_utf8 ($ Str ){/// Utf8 encoding Regular Expression Detection Function// Copyright QQ: 8292669 http://www.cnblogs.com/chengmo$ Re ='/^ ([\ X01-\ x7f] | [\ xc0-\ xdf] [\ X80-\ xbf] | [\ xe0-\ XeF] [\ X80-\ xbf] {2} | [\ xf0-\ xf7] [\ X80-\ xbf] {3} | [\ xf8-\ xfb] [\ X80-\ xbf] {4} | [\ xfc-\ XFD] [\ X80-\ xbf] {5 }) + $/'; Return preg_match ($ re, $ Str );}
 
The above execution result is returned as 1, and "Yuan" itself should be GBK encoded. It seems that the above function still cannot thoroughly check utf8 encoding. Analyze the cause. From the above regular expression, we can see that the length of the six dimensions of utf8 ranges from 1 to 6 bytes. While GBK is 1-2 bytes. Therefore, they will check for coincidence between 1-2 bytes. When one byte is used, the encoding and character correspondence between GBK and utf8 are the same, but when the two bytes are used, the encoding and character are different.
 
 
 
By querying the GBK encoding table: The http://www.knowsky.com/resource/gb2312tbl.htm further confirms that the range is in:
[C0-df] [a0-bf] within the Chinese characters will have problems.If the combination of Chinese characters in the pure range is a string, the situation cannot be determined. If it is combined with other range characters, it can be correctly determined.
 
 
The GBK and utf8 character sets overlap with the following characters: (GBK encoding table)
 
 
Code + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + A + B + C + D + E + fc0a0 including expanded wide pull wax c0b0 La la pity c1b0 blind curtain restrained face chain Lian Liang two volume c2a0 lung ridge together Lou Lu si Lu furnace c2b0 collected halogen Lu lu AO c3a0 Mao c3b0 Mao emei media magnesium Momo c4b0 Momo Mu c5a0 twist ox twist button new pus strong Nong Nuu c5b0 abuse malaria Nuo nuuo Oh ou Luang puke Even pub Hey c6a0 beer pig que fart partial film fraud float c6b0 Piao glance fight frequency poor products hire ping with bottle c7a0 just contact lead qianqian Qian c7b0 qianqian qianqianqianqianqianyu embed an owe apology gun to throw the cavity and the wall, and the strong c8a0 to take the right to go to the circle, Quan, dog certificate c8b0 advise the lack of acetylene but Magpie indeed que skirt group Ran dye c9a0 umbrella sansong funeral Sao sweep sierse color sensan SA c9b0 sand killing Sha silly what Sha sizing Sun Shan delete dead caa0 province Sheng Shi missing Shi Xi Shi cab0 pick up when what food eclipse Shi ya so shile start type shshshi cba0 will brush playing throw drop handsome tie Shuang who water sleep cbb0 tax suck instantaneous shun Shun said Shuo si cca0 Otti ti Tai state jialong Tan Tong tube system pain stealing head transparent convex bald burst figure tu cea0 Wei feed Wei cfa0 Xi Xia xiao xuexue xuoxun Xun ten queries d1b0 to seek to train and patrol flood training Xun pressure to hold the crow duck ah ya Bud d2a0 shake Yao bite scoop medicine to Yao coconut ye d2b0 Ye Ying yo Yong d4a0 Yu Yuan yue key d5a0 brake y Zha squeeze how first blow fraud pick Zhai narrow debt village d5b0 Zhan felt Zhan stick zhanzhan Zhan dip stack Zhan d6a0 Frame zhizhi brick to make a profit, pile village d7b0 installation makeup hit the Zhuang shape of the vertebral cone chasing pendant repair quasi-catch Zhuo d8a0 ← really awkward ← d8b0 please refer to the following: please refer to the zookeeper dadadadaa0 dddddd zookeeper dba0 zookeeper dbb0 zookeeper there is no difference between the two databases. please refer to the following example for the raspberry ddb0 and the zookeeper. wei was very famous, and he was very famous.

Any combination of Chinese Characters in these ranges will not be decoded as utf8, which is the root cause of utf8 encoding failure.Therefore, we need to thoroughly check the UTF-8 encoding and eliminate these interference.

  1. Compiled PHP
  2.  <? PHP // The current encoding is GBK. This function uses GBK and utf8 as an example. $ STR =" Yuan Xiao "; Echo checkutf8 ($ Str); echo checkutf8 (iconv (' GBK ',' UTF-8 ', $ Str); $ STR =" Zookeeper "; Echo checkutf8 ($ Str); echo checkutf8 (iconv (' GBK ',' UTF-8 ', $ Str ));/*** Check whether the string is utf8 encoded ** @ Param string $ STR input string * @ Param string $ extzh exclude overlapping Chinese characters, * @ return 1 | 0 1 is utf8 0 not utf8 */ Function checkutf8 ($ STR, $ extzh = 1 ){ /// Utf8 encoding Regular Expression Detection Function  /// Copyright QQ: 8292669  /// Author CHENG Mo http://www.cnblogs.com/chengmo  // GBK, utf8 overlapping range is: [c0-df] [a0-bf] This character in utf8, there is no corresponding character in GBK encoding so conversion to GBK will appear "? "No. If ($ extzh = 1) {$ Re =' /^ ([\ X01-\ x7f] | [\ xc0-\ xdf] [\ xa0-\ xbf]) + $/ '; /// If the character is treated as utf8, a problem occurs when it is converted to GBK "? . Therefore, the direct return is not utf8 If (preg_match ($ re, $ Str )) /// Public character Verification Successful {Return 0;/// Not utf8 } $ Re =' /^ ([\ X01-\ x7f] | [\ xc0-\ xdf] [\ X80-\ xbf] | [\ xe0-\ XeF] [\ X80-\ xbf] {2} | [\ xf0-\ xf7] [\ X80-\ xbf] {3} | [\ xf8-\ xfb] [\ X80-\ xbf] {4} | [\ xfc-\ XFD] [\ X80-\ xbf] {5 }) + $/ '; Return preg_match ($ re, $ Str );}

The above is a compromise method. In ChinaProgramYes: GBK and utf8. The above method can basically solve the problem. It can avoid identifying GBK as utf8 by mistake, and then transcode it from utf8-> GBK "?" Dear friend, what are your better methods? Please contact us !!

Author: chengmo QQ: 8292669
Source: http://www.cnblogs.com/chengmo
The copyright of this article is shared by the author and the blog. You are welcome to repost it. Please add the original article link.

 
 
 
 
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.