1. Use metacharacters to match Chinese Characters in general terms ,/.*? /S, can match a piece of Chinese, which can be achieved in both the ANSI (gb2312) and UTF-8 environment program code. However, \ w cannot match Chinese characters. I once saw in a book "proficient in regular expressions" (edited by SHA Jin, people's post and telecommunications Publishing House) that "\ w" can be used to match Chinese characters. Here I can correct it if I cannot use php. You can use "/./", "/[^ \ d]/", "/[^ a]/" to match Chinese characters.
2. If you want to accurately match Chinese characters, that is, Match Chinese characters only, or match Chinese characters with full-angle punctuation, you need to use different methods according to different encoding environments. The following describes two commonly used encodings (gb2312, UTF-8:
In the ANSI (gb2312) environment, you can use the [chr (0xnn)-chr (0xmm)] method to match. For example, you can use this method in an article, "/[". chr (0xb0 ). "-". chr (0xf7 ). "] +/", which can be used, but this is too general. This expression matches all characters in the gb2312 encoded table, including Chinese characters, punctuation marks, and Japanese hirakana, some other symbols are unknown. From the encoding table, we can see that the encoding range of Chinese characters is 0xb0a1-0xf7fe, and gb2312 is encoded in two bytes, with each byte having the highest bit being 1. Therefore, you can write regular expressions that match Chinese characters:
"/([". Chr (0xb0 ). "-". chr (0xf7 ). "] [". chr (0xa1 ). "-". chr (0xfe ). "])/", this expression can match a Chinese character, and the quantitative relationship can be easily expanded.
In addition, if you want to match the full-angle punctuation without matching Chinese characters, you can write it as follows:
"/([". Chr (0xa1 ). "-". chr (0xa3 ). "] [". chr (0xa1 ). "-". chr (0xff ). "])/", which is the matching symbol in the encoding range 0xa1a1-0xa3ff. Others are similar.
3. The following describes Chinese matching in the UTF-8 environment. Similar to the above, you can also use unicode encoding tables to determine Chinese matching. The encoding table shows that the Chinese encoding range is 0x4e00-0x9fa5, so the regular expression can be written as follows:
"/[\ X {4e00}-\ x {9fa5}]/u", \ x {nnnn} indicates the hexadecimal format of the character, for more information, see the php manual. Note the pattern modifier u. In the php manual, u (PCRE_UTF8) is used to enable additional features that are incompatible with Perl in a pcre. The pattern string is treated as a UTF-8. This modifier is available in Unix from PHP 4.1.0 and win32 from PHP 4.2.3. Check the validity of the UTF-8 in the mode from PHP 4.3.5. This is exactly what is necessary for correct matching. In fact, I also want to remind you that it is best to add the u modifier to match strings using metacharacters in the UTF-8 environment. This is just experience.
Here are two examples: www.2cto.com
(1) ANSI programming environment:
$ Strtest = "yyg Chinese Character yyg ";
$ Pregstr = "/([". chr (0xb0 ). "-". chr (0xf7 ). "] [". chr (0xa1 ). "-". chr (0xfe ). "]) +/I ";
If (preg_match ($ pregstr, $ strtest, $ matchArray )){
Echo $ matchArray [0];
}
// Output: Chinese Characters
(2) Utf-8 programming environment:
$ Strtest = "yyg Chinese Character yyg ";
$ Pregstr = "/[\ x {4e00}-\ x {9fa5}] +/u ";
If (preg_match ($ pregstr, $ strtest, $ matchArray )){
Echo $ matchArray [0];
}
// Output: Chinese Characters
Author: zdrjlamp