Introduction to Regular (gb2312/utf-8) matching Chinese

Last Update:2016-07-25 Source: Internet

Author: User

Tags character classes modifiers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes the following, used to match the Chinese regular (GB2312 and utf-8 format), the need for friends, refer to it.

The following is a list of the modifiers that are currently likely to be used in PCRE. The internal PCRE names of these modifiers are in parentheses. Spaces and line breaks in the modifier are ignored, and other characters cause errors.

I hope this article can help you to understand and master the relevant concepts of regular expressions in more depth.

I (pcre_caseless) if this modifier is set, the characters in the pattern will match the uppercase and lowercase letters.

M (pcre_multiline) By default, PCRE consists of the target string as a single "line" character (even if it contains a line break). The line start metacharacters (^) match only the beginning of the string, and the "Line End" metacharacters ($) only match the end of the string, or the last character is preceded by a newline (unless the D modifier is set). This is the same as Perl.

When this modifier is set, line start and line end match the beginning and end of the entire string, respectively, after and before the newline character. This is equivalent to Perl's/M modifier. If there is no "\ n" character in the target string or there is no ^ or $ in the pattern, setting this modifier has no effect.

S (pcre_dotall) If this modifier is set, the dot character (.) in the pattern matches all characters, including the line break. Without this setting, the line break is not included. This is equivalent to Perl's/s modifier. Exclude character classes such as [^a] always match line breaks, regardless of whether this modifier is set.

X (pcre_extended) If this modifier is set, the whitespace characters in the pattern are ignored except for escaping or being completely ignored in the character class, and all characters between the # and the next newline character outside the non-escaped characters class, including both ends. This is equivalent to Perl's/x modifier, which allows annotations to be added to complex schemas. Note, however, that this applies only to data characters. Whitespace characters may never appear in a special sequence of characters in a pattern, such as a sequence that introduces conditional sub-patterns (? (middle).

E If this modifier is set, Preg_replace () makes a normal substitution of the inverse reference in the replacement string, evaluates it as a PHP code, and replaces the searched string with its result.

Only Preg_replace () uses this modifier, and the other PCRE function ignores it.

Note: This modifier is not available in PHP3.

A (pcre_anchored) if this modifier is set, the pattern is coerced to "anchored", which forces the match to start only from the beginning of the target string. This effect can also be implemented by the appropriate pattern itself (the only method implemented in Perl).

D (pcre_dollar_endonly) If this modifier is set, the dollar character in the pattern matches only the end of the target string. Without this option, if the last character is a line break, the dollar sign will also match this character before (but not before any other line break). This option is ignored if the M modifier is set. There is no equivalent modifier in Perl.

s when a pattern is to be used several times, it is worth analyzing it to speed up the match. If this modifier is set, additional analysis is performed. Currently, parsing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U (pcre_ungreedy) This modifier reverses the value of the matched quantity so that it is not the default repetition, and becomes followed by "?". only to become repetitive. This is not compatible with Perl. You can also set the (? U) modifier or enable this option after the quantifier with a question mark (such as. *?).

X (Pcre_extra) This modifier enables an additional feature that is incompatible with Perl in a PCRE. Any backslash in the pattern followed by a letter with no special meaning causes an error, preserving this combination for future expansion. By default, as with Perl, a backslash followed by a letter with no special meaning is treated as the letter itself. No other feature is currently controlled by this modifier.

U (PCRE_UTF8) This modifier enables an additional feature that is incompatible with Perl in a PCRE. The pattern string is treated as UTF-8. This modifier is available under Unix from PHP 4.1.0 and is available under Win32 from PHP 4.2.3. starting from PHP 4.3.5 Check the UTF-8 legitimacy of the mode. That's all, the knowledge about PHP expression matching Chinese content, hope to help everyone. The programmer's home, wish everybody to study the progress.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More