Consider UTF8 coded regular expressions (PHP version) when encountering problems related to code recognition

Source: Internet
Author: User
Consider UTF8 coded regular expressions (PHP version) when encountering problems with code recognition

?

    • Cause:

Recently encountered one thing, an interface can receive incoming encoding may be UTF-8,GBK two kinds. The person who has done the encoding aspect should know, what code does not have what tag bit inside the string. However, the UTF-8 encoding is special, so it can be checked by regular expressions. As long as the discovery is utf-8 encoded. The conversion, not utf-8, is treated as GBK. Coding some common problems can be viewed: By the Web program garbled start mining (BOM header, character set and garbled)

    • Let's go:

Know this principle, get the task right away and start working. Think of PHP version has a mbstring module can be encoded to detect the conversion:

  
    PHP//Current encoding is GBK$str = " China "; $aStrList =array ($str, Iconv ('gbk', 'utf-8', $str)); foreach ($aStrList as $v) {echo mb_convert_encoding ($v, 'gbk', 'utf-8,gbk'), "\ r \ n";}
?
Operation Result:
?
Two different coded "China" can be automatically converted to GBK encoding with a function mb_convert_encoding. Home, try to decode with utf-8, if there is a problem, it will be used GBK transcoding. It seems that the problem has been solved, haha, can be ...
?
    1. Problem:
After the release, calm a few days, suddenly received feedback: there is a Chinese: "Yuan small" decoding error. ⊙﹏⊙ b Khan ... and want to .... (Is there a problem with the PHP built-in detection module, or where I lack ...)
⊙﹏⊙ b Khan  ... It seems to be a problem, check the manual:
?
    1. Problem:
Can you write a check on your own regular expression to see what's going on? To write a regular expression, you must first understand the UTF8 encoding specification, view: Http://zh.wikipedia.org/zh/UTF-8?

Currently, there are only 6 dimensions of the encoding set: PHP Gets the dimension code

   Php//Get the range of each dimension of UTF8 word encodingEcho Base_convert ('1111111', 2,16), '\ r \ n";//Dimension 1Echo Base_convert ('10000000', 2,16), Base_convert ('10111111', 2,16), '\ r \ n"; Echo Base_convert ('11000000', 2,16), Base_convert ('11011111', 2,16), '\ r \ n";//Dimension 2Echo Base_convert ('11100000', 2,16), Base_convert ('11101111', 2,16), '\ r \ n";//Dimension 3Echo Base_convert ('11110000', 2,16), Base_convert ('11110111', 2,16), '\ r \ n";//Dimension 4Echo Base_convert ('11111000', 2,16), Base_convert ('11111011', 2,16), '\ r \ n";//Dimension 5Echo Base_convert ('11111100', 2,16), Base_convert ('11111101', 2,16), '\ r \ n";//Dimension 6

Operation Result:

    1. The corresponding regular expression is obtained through the above 6 dimensions:

[\x01-\x7f]| [\XC0-\XDF] [\x80-\xbf]| [\xe0-\xef] [\X80-\XBF] {2}| [\xf0-\xf7] [\X80-\XBF] {3}| [\XF8-\XFB] [\X80-\XBF] {4}| [\XFC-\XFD] [\X80-\XBF] {5}

These are the dimensions of each dimension, respectively.

 Php//the current encoding is GBK  $str = " Yuan  "; Echo UrlEncode ($STR); Echo Is_utf8 ($STR); function Is_utf8 ($str) {///utf8 coded regular detection function  ///copyright qq:8292669/HTTP/ Www.cnblogs.com/chengmo  $re = '  
The above execution results are returned as 1, and then "Yuan" itself should be GBK encoded. It seems that the above function is still unable to thoroughly check the UTF8 encoding. Analysis of the reason, from the above can be seen, the UTF8 6 dimensions corresponding byte length from 1-6 bytes. And GBK is 1-2 bytes. So they will check for overlap between 1-2 byte lengths. 1 bytes when the encoding of GBK and UTF8 is the same as the character correspondence, but 2 bytes, the corresponding encoding and character are different.
?
By querying the GBK encoding table: Http://www.knowsky.com/resource/gb2312tbl.htm further confirms that the scope will be:
[C0-DF] [A0-BF]  if the Chinese character combination of pure this range is a string, it will not be able to judge the situation. It can be correctly judged if it is combined with any other range of characters. 
?

GBK characters that correspond to the UTF8 character set overlap are: (GBK encoded table)

?

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21st

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.