Parse POSIX and Perl standard regular expression differences _php tips

Source: Internet
Author: User
Tags control characters modifier modifiers posix printable characters expression engine

Regular Expressions (Regular Expression, abbreviated as Regexp,regex or REGXP), also known as regular expressions, regular representations or regular expressions, or formal representations or regular representations, refer to a To describe or match a sequence of strings that match a certain syntactic rule. In many text editors or other tools, regular expressions are often used to retrieve and/or replace text content that conforms to a pattern. Many programming languages support the use of regular expressions for string manipulation. For example, in Perl, a powerful, regular expression engine was built. The concept of regular expressions was initially popularized by tool software (such as SED and grep) in Unix. (Excerpt from Wikipedia)

PHP also uses two sets of regular expression rules, one set by the Institute of Electrical and Electronics Engineers (IEEE), the POSIX Extended 1003.2 compliant regular (in fact, PHP is not perfect for this standard), and the other from Pcre (Perl compatible The Regular Expression Library provides Perl compatible regular, which is an open-source software, the author of Philip Hazel.

Functions that use POSIX compliant rules are:
Ereg_replace ()
Ereg ()
Eregi ()
Eregi_replace ()
Split ()
Spliti ()
Sql_regcase ()
Mb_ereg_match ()
Mb_ereg_replace ()
Mb_ereg_search_getpos ()
Mb_ereg_search_getregs ()
Mb_ereg_search_init ()
Mb_ereg_search_pos ()
Mb_ereg_search_regs ()
Mb_ereg_search_setpos ()
Mb_ereg_search ()
Mb_ereg ()
Mb_eregi_replace ()
Mb_eregi ()
Mb_regex_encoding ()
Mb_regex_set_options ()
Mb_split ()

Functions that use the Perl compatibility rules are:
Preg_grep ()
Preg_replace_callback ()
Preg_match_all ()
Preg_match ()
Preg_quote ()
Preg_split ()
Preg_replace ()

Delimiter:

POSIX-compatible positive does not have delimiters, and the corresponding parameters of the function are considered regular.

Perl compatible regular can use any character that is not a letter, number, or backslash (\) as a delimiter, and if the character that is a delimiter must be used in the expression itself, it needs to be escaped with a backslash. You can also use (), {},[], and <> as delimiters

Modifier:

POSIX compatible positive does not have a modifier.

Modifiers that may be used in Perl-compatible regularization (spaces in modifiers and line breaks are ignored and other characters can cause errors):

I (pcre_caseless):
ignores case when matching.

M (pcre_multiline):
When this modifier is set, the line start (^) and Row end ($) match the end of the entire string, and the after and before the newline character (\ n) is matched respectively.

S (pcre_dotall):
If this modifier is set, the dot meta character (.) in the pattern matches all characters, including line breaks. Without this setting, line breaks are not included.

X (pcre_extended):
If this modifier is set, whitespace characters in the pattern are completely ignored in addition to being escaped or outside of the character class.

E:
If this modifier is set, preg_replace () substitutes the reverse reference in the replacement string as a PHP code and replaces the searched string with the result. Only Preg_replace () uses this modifier, and the other PCRE functions are ignored.

A (pcre_anchored):
If this modifier is set, the pattern is coerced to "anchored", which means that only the beginning of the target string is forced to match.

D (pcre_dollar_endonly):
If this modifier is set, the line end ($) in the pattern matches only the end of the destination string. Without this option, if the last character is a line break, it is also matched. This option is ignored if the M modifier is set.

S
When a pattern is used several times, it is worthwhile to analyze it for the sake of accelerated matching. If this modifier is set, an additional analysis is performed. Currently, parsing a pattern is only useful for non-anchored patterns that do not have a single fixed starting character.

U (pcre_ungreedy):
Make "?" The default match becomes greedy.

X (Pcre_extra):
Any backslash in the pattern followed by a letter with no special meaning causes an error, thus preserving this combination for future expansion. By default, a backslash followed by a letter with no special meaning is treated as the letter itself.

U (PCRE_UTF8):
The pattern string is treated as UTF-8.

Logical partition:
POSIX-compatible regular and Perl-compatible regular logical block symbols function and use exactly the same way:
[]: Contains information about any selected action.
{}: Contains relevant information about the number of matches.
(): Contains information about a logical interval that can be used for reference operations.
|: The expression "or", [AB] and a|b are equivalent.

Metacharacters is related to "[]":

There are two different sets of metacharacters: one that can be identified in the pattern except in square brackets, and one that is recognized within the square brackets "[]".

POSIX-compatible regular and Perl-compatible regular "[] metacharacters" "Consistent":
\ Common escape characters for several purposes
^ matches the beginning of a string
$ match End of string
? Match 0 or 1
* Match 0 or more characters of the preceding specified type
+ Match 1 or more characters of the preceding specified type

POSIX compatible regular and Perl compatible regular "[] outside" "Inconsistent" metacharacters:
. Perl compatible regular matches any of the characters except line breaks
. POSIX compatible regular matches any one character

POSIX compatible with regular and Perl compatible regular "[] within" "Consistent" metacharacters:
\ Common escape characters for several purposes
^ takes an inverse character, but is valid only if it is the first character
-Specify the character ASCII range, carefully study the ASCII code, and you will find [w-c] equivalent to [wxyz\\^_ ' ABC]

POSIX compatible with regular and Perl compatible regular "[] within" "Inconsistent" metacharacters:
-The specified in POSIX compatible regular [A-C-E] throws an error.
-the designation of [A-C-E] in the Perl compatible regular is equivalent to [A-E].

The number of matches is related to "{}":
POSIX compatible regular and Perl compatibility are exactly the same number of matches:
{2}: Indicates that the previous character was matched 2 times
{2,}: to match the preceding characters 2 or more times, the default is greedy (as many) matches
{2,4}: Indicates that the preceding character was matched 2 or 4 times

The logical interval is related to "()":
The area that is included is a logical interval, and the main function of the logical interval is to show the logical order in which some characters appear, and the other is to refer to (you can refer to a variable in this interval). The latter function is more peculiar:
<?php
$str = "http://www.163.com/";
POSIX compatible Regular:
Echo ereg_replace ("(. +)", "<a href = \\1 >\\1</a>", $str);
Perl compatible Regular:
echo Preg_replace ("/(. +)/", "<a href = $ >$1</a>", $STR);
Show two links
?>

When quoted, parentheses can be nested, and the logical order is calibrated according to the order in which they appear.

type matching:
POSIX compliant:
[: Upper:]: Matching all uppercase letters
[: Lower:] : Match all lowercase letters
[: Alpha:]: Match all letters
[: Alnum:]: Match all letters and numbers
[:d igit:]: Match all numbers
[: xdigit:]: Matches all hexadecimal characters, equivalent to [ 0-9A-FA-F]
[:p UNCT:]: Matches all punctuation, equivalent to [., ' '?!;:]
[: blank:]: Matches space and tab, equivalent to [\ t]
[: space:]: Matches all whitespace characters, equivalent to [\t\n\ R\f\v]
[: Cntrl:]: Matches all ASCII 0 to 31 control characters.
[: Graph:]: matches all printable characters, equivalent to: [^ \t\n\r\f\v]
[:p rint:]: Matches all printable characters and spaces, equivalent to: [^\t\n\r\f\v]
[. C.] : Unknown feature
[=c=]: Feature Unknown
[:;:]: Match the beginning of the word
[::]: Match the end of the word

Perl compatible Regular (see here for Perl is strong):
\a Alarm, that is, BEL character (' 0)
\cx "Control-x" where x is any character
\e Escape (' 0B)
\f page Break formfeed (' 0C)
\ n newline character newline (' 0A ')
\ r/P carriage return (' 0D ')
\ Tab tab (' 0)
\ XHH hexadecimal code is the character of HH
\ddd octal code is DDD, or backreference
\d any decimal digit
\d Any number of non-decimal characters
\s any whitespace character
\s any non-white-space character
\w The character of either "word"
\w the character of any "non word"
\b Line
\b Line
\a the beginning of the target (independent of multiline mode)
\z the end of the target or at the end of the newline match either (independent of multiline mode)
\z End of Target (independent of multiline mode)
first matching location in \g target

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.