Introduction to Regular (gb2312/utf-8) matching Chinese

Source: Internet
Author: User
Tags character classes modifiers
This article describes the following, used to match the Chinese regular (GB2312 and utf-8 format), the need for friends, refer to it.

The following is a list of the modifiers that are currently likely to be used in PCRE. The internal PCRE names of these modifiers are in parentheses. Spaces and line breaks in the modifier are ignored, and other characters cause errors.

I hope this article can help you to understand and master the relevant concepts of regular expressions in more depth.

I (pcre_caseless) if this modifier is set, the characters in the pattern will match the uppercase and lowercase letters.

M (pcre_multiline) By default, PCRE consists of the target string as a single "line" character (even if it contains a line break). The line start metacharacters (^) match only the beginning of the string, and the "Line End" metacharacters ($) only match the end of the string, or the last character is preceded by a newline (unless the D modifier is set). This is the same as Perl.

When this modifier is set, line start and line end match the beginning and end of the entire string, respectively, after and before the newline character. This is equivalent to Perl's/M modifier. If there is no "\ n" character in the target string or there is no ^ or $ in the pattern, setting this modifier has no effect.

S (pcre_dotall) If this modifier is set, the dot character (.) in the pattern matches all characters, including the line break. Without this setting, the line break is not included. This is equivalent to Perl's/s modifier. Exclude character classes such as [^a] always match line breaks, regardless of whether this modifier is set.

X (pcre_extended) If this modifier is set, the whitespace characters in the pattern are ignored except for escaping or being completely ignored in the character class, and all characters between the # and the next newline character outside the non-escaped characters class, including both ends. This is equivalent to Perl's/x modifier, which allows annotations to be added to complex schemas. Note, however, that this applies only to data characters. Whitespace characters may never appear in a special sequence of characters in a pattern, such as a sequence that introduces conditional sub-patterns (? (middle).

E If this modifier is set, Preg_replace () makes a normal substitution of the inverse reference in the replacement string, evaluates it as a PHP code, and replaces the searched string with its result.

Only Preg_replace () uses this modifier, and the other PCRE function ignores it.

Note: This modifier is not available in PHP3.

A (pcre_anchored) if this modifier is set, the pattern is coerced to "anchored", which forces the match to start only from the beginning of the target string. This effect can also be implemented by the appropriate pattern itself (the only method implemented in Perl).

D (pcre_dollar_endonly) If this modifier is set, the dollar character in the pattern matches only the end of the target string. Without this option, if the last character is a line break, the dollar sign will also match this character before (but not before any other line break). This option is ignored if the M modifier is set. There is no equivalent modifier in Perl.

s when a pattern is to be used several times, it is worth analyzing it to speed up the match. If this modifier is set, additional analysis is performed. Currently, parsing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U (pcre_ungreedy) This modifier reverses the value of the matched quantity so that it is not the default repetition, and becomes followed by "?". only to become repetitive. This is not compatible with Perl. You can also set the (? U) modifier or enable this option after the quantifier with a question mark (such as. *?).

X (Pcre_extra) This modifier enables an additional feature that is incompatible with Perl in a PCRE. Any backslash in the pattern followed by a letter with no special meaning causes an error, preserving this combination for future expansion. By default, as with Perl, a backslash followed by a letter with no special meaning is treated as the letter itself. No other feature is currently controlled by this modifier.

U (PCRE_UTF8) This modifier enables an additional feature that is incompatible with Perl in a PCRE. The pattern string is treated as UTF-8. This modifier is available under Unix from PHP 4.1.0 and is available under Win32 from PHP 4.2.3. starting from PHP 4.3.5 Check the UTF-8 legitimacy of the mode. That's all, the knowledge about PHP expression matching Chinese content, hope to help everyone. The programmer's home, wish everybody to study the progress.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.