Parsing POSIX vs. Perl standard Regular Expression Differences _php tutorial

Source: Internet
Author: User
Tags control characters printable characters expression engine
Regular Expressions (Regular expression, abbreviated as Regexp,regex or REGXP), also known as regular expressions, regular expressions, or regular or normalized representations or formal representations, refer to a A single string that describes or matches a series of strings that conform to a certain syntactic rule. In many text editors or other tools, regular expressions are often used to retrieve and/or replace text content that conforms to a pattern. Many programming languages support the use of regular expressions for string manipulation. For example, in Perl, a powerful in-regular expression engine is built in. The concept of regular expressions was initially popularized by tools software (such as SED and grep) in Unix. (Excerpt from Wikipedia)

PHP uses two regular expression rules at the same time, one set is the POSIX Extended 1003.2 compliant regular by the Institute of Electrical and Electronics Engineers (IEEE) (In fact PHP is not perfect for this standard), and the other set comes from Pcre (Perl Compatible Regular Expression) Library provides Perl-compatible regular, an open-source software, the author of Philip Hazel.

The functions that use POSIX compliant rules are:
Ereg_replace ()
Ereg ()
Eregi ()
Eregi_replace ()
Split ()
Spliti ()
Sql_regcase ()
Mb_ereg_match ()
Mb_ereg_replace ()
Mb_ereg_search_getpos ()
Mb_ereg_search_getregs ()
Mb_ereg_search_init ()
Mb_ereg_search_pos ()
Mb_ereg_search_regs ()
Mb_ereg_search_setpos ()
Mb_ereg_search ()
Mb_ereg ()
Mb_eregi_replace ()
Mb_eregi ()
Mb_regex_encoding ()
Mb_regex_set_options ()
Mb_split ()

The functions that use Perl compatibility rules are:
Preg_grep ()
Preg_replace_callback ()
Preg_match_all ()
Preg_match ()
Preg_quote ()
Preg_split ()
Preg_replace ()


The POSIX-compliant regular does not have delimiters, and the corresponding parameters of the function are considered to be regular.

Perl-compatible regular can use any character that is not a letter, number, or backslash (\) as a delimiter, and if the character that is the delimiter must be used in the expression itself, it needs to be escaped with a backslash. You can also use (), {},[] and <> as delimiters


POSIX compliant regular no modifier.

Possible modifiers used in Perl-compatible regularization (spaces in the modifier and line breaks are ignored, other characters cause errors):

I (pcre_caseless):
ignores case when matching.

M (pcre_multiline):
When this modifier is set, the line start (^) and line end ($) match the beginning and end of the entire string, respectively, after and before the line break (\ n).

S (pcre_dotall):
If this modifier is set, the dot character (.) in the pattern matches all characters, including the line break. Without this setting, the line break is not included.

X (pcre_extended):
If this modifier is set, white-space characters in the pattern are completely ignored, except those that are escaped or are not in the character class.

If this modifier is set, Preg_replace () makes a normal substitution of the inverse reference in the replacement string, evaluates it as a PHP code, and replaces the searched string with its result. Only Preg_replace () uses this modifier, and the other PCRE function ignores it.

A (pcre_anchored):
If this modifier is set, the pattern is coerced to "anchored", which forces the match to start only from the beginning of the target string.

D (pcre_dollar_endonly):
If this modifier is set, the end of the line in the pattern ($) matches only the end of the target string. Without this option, if the last character is a newline, it will be matched. This option is ignored if the M modifier is set.

When a pattern is to be used several times, it is worth analyzing it to speed up the match. If this modifier is set, additional analysis is performed. Currently, parsing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U (pcre_ungreedy):
Make "?" The default match becomes a greedy state.

X (Pcre_extra):
Any backslash in the pattern followed by a letter with no special meaning causes an error, preserving this combination for future expansion. By default, a backslash followed by a letter with no special meaning is treated as the letter itself.

The pattern string is treated as UTF-8.

Logical partition:
The POSIX compatible regular and Perl compatible regular logical partition symbols function exactly the same as the use method:
[]: Contains information about any of the selected actions.
{}: Contains information about the number of matches.
(): Contains information about a logical interval that can be used for reference operations.
|: denotes "or", [AB] and a|b are equivalent.

Metacharacters related to "[]":

There are two different sets of metacharacters: one that can be identified in the pattern except in square brackets, and one that is recognized within the square brackets "[]".

POSIX compatible regular and Perl compatible regular "[] metacharacters" "Consistent":
\ General escape character for several purposes
^ matches the beginning of the string
$ matches the end of a string
? Match 0 or 1
* Matches 0 or more characters of the preceding specified type
+ Match 1 or more characters preceding the specified type

POSIX compatible regular and Perl compatible regular "[] metacharacters" "Inconsistent":
. Perl-compatible regular matches any character except line breaks
. POSIX compatible regex matches any one character

POSIX compliant regular and Perl compatible regular "[] metacharacters" "Consistent":
\ General escape character for several purposes
^ takes an inverse character, but only if it is the first character
-Specify the ASCII range of characters, carefully study the ASCII code, and you will find [w-c] equivalent to [wxyz\\^_ ' ABC]

POSIX compatible regular and Perl compatible regular "[]" Inconsistent "metacharacters:
-the designation of [A-C-E] in the POSIX compliant regular will throw an error.
-the designation of [A-C-E] in a Perl-compliant regular is equivalent to [A-E].

The number of matches is related to ' {} ':
The POSIX compliant regular and Perl compatible regex are exactly the same as the number of matches:
{2}: Indicates a match for the preceding character 2 times
{2,}: Matches the preceding character 2 or more times, the default is greedy (as many as possible) matches
{2,4}: Indicates a match for the preceding character 2 or 4 times

The logical interval is related to "()":
The area that is used () is a logical interval, and the main function of the logical interval is to reflect the logical order in which some characters appear, and the other is to use it as a reference (you can refer a value in this interval to a variable). The latter is a more peculiar function:
$str = "";
POSIX compliant Regular:
Echo ereg_replace ("(. +)", "\\1", $str);
Perl compatible Regular:
echo Preg_replace ("/(. +)/", "$", $str);
Show two links

At the time of reference, parentheses can be nested, and the logical order is calibrated in the order in which it appears.

Type match:
POSIX compliant Regular:
[: Upper:]: Matches all uppercase letters
[: Lower:]: matches all lowercase letters
[: Alpha:]: Matches all letters
[: Alnum:]: Matches all letters and numbers
[:d Igit:]: Matches all numbers
[: Xdigit:]: Matches all hexadecimal characters, equivalent to [0-9a-fa-f]
[:p UNCT:]: Matches all punctuation, equivalent to [., "'?!;:]
[: Blank:]: Match space and tab, equivalent to [\ t]
[: Space:]: Matches all whitespace characters, equivalent to [\t\n\r\f\v]
[: Cntrl:]: Matches all the ASCII 0 to 31 control characters.
[: Graph:]: matches all printable characters, equivalent to: [^ \t\n\r\f\v]
[:p rint:]: Matches all printable characters and spaces, equivalent to: [^\t\n\r\f\v]
[. C.] : Unknown function
[=c=]: Unknown function
[: <:]: Matches the beginning of a word
[:;:]: Matches the end of a word

Perl-compatible regular (here you can see the power of Perl Regular):
\a Alarm, which is the BEL character (' 0)
\cx "Control-x", where x is any character
\e Escape (' 0B)
\f page Break formfeed (' 0C)
\ n line break newline (' 0A)
\ r return character carriage return (' 0D)
\ t Tab tab (' 0)
\XHH hexadecimal code for HH characters
\DDD octal code for DDD characters, or backreference
\d any decimal digit
\d a character of any non-decimal number
\s any whitespace character
\s any non-whitespace character
\w Characters of any "word"
\w any "non-word" character
\b Word Dividing line
\b Non-word dividing line
\a the beginning of the target (independent of multiline mode)
\z the end of a target or a newline match either at the end (independent of multiline mode)
\z End of Target (independent of multiline mode)
\g the first matching position in a target true techarticle Regular Expressions (Regular expression, abbreviated as Regexp,regex or REGXP), also known as regular expressions, formal representations or regular expressions, or formalized representations or formal representations, refer to a ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.