Improve Php Regular Expressions matching Chinese Characters

Last Update:2013-11-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Use metacharacters to match Chinese Characters in general terms ,/.*? /S, can match a piece of Chinese, which can be achieved in both the ANSI (gb2312) and UTF-8 environment program code. However, \ w cannot match Chinese characters. I once saw in a book "proficient in regular expressions" (edited by SHA Jin, people's post and telecommunications Publishing House) that "\ w" can be used to match Chinese characters. Here I can correct it if I cannot use php. You can use "/./", "/[^ \ d]/", "/[^ a]/" to match Chinese characters.
2. If you want to accurately match Chinese characters, that is, Match Chinese characters only, or match Chinese characters with full-angle punctuation, you need to use different methods according to different encoding environments. The following describes two commonly used encodings (gb2312, UTF-8:
In the ANSI (gb2312) environment, you can use the [chr (0xnn)-chr (0xmm)] method to match. For example, you can use this method in an article, "/[". chr (0xb0 ). "-". chr (0xf7 ). "] +/", which can be used, but this is too general. This expression matches all characters in the gb2312 encoded table, including Chinese characters, punctuation marks, and Japanese hirakana, some other symbols are unknown. From the encoding table, we can see that the encoding range of Chinese characters is 0xb0a1-0xf7fe, and gb2312 is encoded in two bytes, with each byte having the highest bit being 1. Therefore, you can write regular expressions that match Chinese characters:
"/([". Chr (0xb0 ). "-". chr (0xf7 ). "] [". chr (0xa1 ). "-". chr (0xfe ). "])/", this expression can match a Chinese character, and the quantitative relationship can be easily expanded.
In addition, if you want to match the full-angle punctuation without matching Chinese characters, you can write it as follows:
"/([". Chr (0xa1 ). "-". chr (0xa3 ). "] [". chr (0xa1 ). "-". chr (0xff ). "])/", which is the matching symbol in the encoding range 0xa1a1-0xa3ff. Others are similar.
3. The following describes Chinese matching in the UTF-8 environment. Similar to the above, you can also use unicode encoding tables to determine Chinese matching. The encoding table shows that the Chinese encoding range is 0x4e00-0x9fa5, so the regular expression can be written as follows:
"/[\ X {4e00}-\ x {9fa5}]/u", \ x {nnnn} indicates the hexadecimal format of the character, for more information, see the php manual. Note the pattern modifier u. In the php manual, u (PCRE_UTF8) is used to enable additional features that are incompatible with Perl in a pcre. The pattern string is treated as a UTF-8. This modifier is available in Unix from PHP 4.1.0 and win32 from PHP 4.2.3. Check the validity of the UTF-8 in the mode from PHP 4.3.5. This is exactly what is necessary for correct matching. In fact, I also want to remind you that it is best to add the u modifier to match strings using metacharacters in the UTF-8 environment. This is just experience.
Here are two examples: www.2cto.com
(1) ANSI programming environment:
$ Strtest = "yyg Chinese Character yyg ";
$ Pregstr = "/([". chr (0xb0 ). "-". chr (0xf7 ). "] [". chr (0xa1 ). "-". chr (0xfe ). "]) +/I ";
If (preg_match ($ pregstr, $ strtest, $ matchArray )){
Echo $ matchArray [0];
}
// Output: Chinese Characters
(2) Utf-8 programming environment:
$ Strtest = "yyg Chinese Character yyg ";
$ Pregstr = "/[\ x {4e00}-\ x {9fa5}] +/u ";
If (preg_match ($ pregstr, $ strtest, $ matchArray )){
Echo $ matchArray [0];
}
// Output: Chinese Characters
Author: zdrjlamp

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Improve Php Regular Expressions matching Chinese Characters

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support