Differences between posix and perl standard Regular Expressions

Source: Internet
Author: User
Tags printable characters expression engine

Regular Expression (Regular Expression, abbreviated as regexp, regex or regxp), also known as Regular Expression, Regular Expression or Regular Expression, or Regular Expression, it refers to a single string used to describe or match a series of strings that conform to a certain syntax rule. In many text editors or other tools, regular expressions are usually used to retrieve and/or replace text content that meets a certain pattern. Many programming languages Support string operations using regular expressions. For example, a powerful Regular Expression Engine is built in Perl. The concept of regular expressions was initially popularized by tools in Unix (such as sed and grep. (From Wikipedia)

PHP uses two sets of regular expression rules at the same time. One set is POSIX Extended 1003.2 compatible with regular expressions developed by the Institute of Electrical and Electronics Engineers (IEEE) (in fact, PHP does not fully support this standard ), another set of Perl-Compatible Regular Expressions from the PCRE (PERL Compatible Regular Expression) library is open source software written by Philip Hazel.

Functions using POSIX compatibility rules include:
Ereg_replace ()
Ereg ()
Eregi ()
Eregi_replace ()
Split ()
Spliti ()
SQL _regcase ()
Mb_ereg_match ()
Mb_ereg_replace ()
Mb_ereg_search_getpos ()
Mb_ereg_search_getregs ()
Mb_ereg_search_init ()
Mb_ereg_search_pos ()
Mb_ereg_search_regs ()
Mb_ereg_search_setpos ()
Mb_ereg_search ()
Mb_ereg ()
Mb_eregi_replace ()
Mb_eregi ()
Mb_regex_encoding ()
Mb_regex_set_options ()
Mb_split ()

Functions using PERL compatibility rules include:
Preg_grep ()
Preg_replace_callback ()
Preg_match_all ()
Preg_match ()
Preg_quote ()
Preg_split ()
Preg_replace ()

Delimiters:

POSIX Compatible Regular Expressions Do not have delimiters, and the corresponding parameters of the function are considered regular expressions.

PERL-Compatible Regular Expressions can use any character that is not a letter, number, or backslash (\) as the delimiter. If the character used as the delimiter must be used in the expression itself, escape with the backslash. You can also use (), {}, [], and <> As delimiters.

Modifier:

POSIX Compatible Regular Expressions Do not have modifiers.

Possible modifiers in PERL-Compatible Regular Expressions (space and line feed in the modifier are ignored, and other characters may cause errors ):

I (PCRE_CASELESS ):
Case Insensitive for matching.

M (PCRE_MULTILINE ):
When this modifier is set, in addition to matching the beginning and end of the entire string, the row start (^) and the row end ($) also match the line break (\ n) respectively) and before.

S (PCRE_DOTALL ):
If this modifier is set, the dot metacharacters (.) In the pattern match all characters, including line breaks. If this parameter is not set, line breaks are not included.

X (PCRE_EXTENDED ):
If this modifier is set, the white space characters in the mode are ignored in addition to escaping or in the character class.

E:
If this modifier is set, preg_replace () replaces the reverse reference in the replacement string as a normal replacement, evaluate it as the PHP code, and use the result to replace the searched string. This modifier is only used by preg_replace (), which is ignored by other PCRE functions.

A (PCRE_ANCHORED ):
If this modifier is set, the pattern is forced to "anchored", that is, it is forced to match only from the beginning of the target string.

D (PCRE_DOLLAR_ENDONLY ):
If this modifier is set, the row end ($) in the mode matches only the end of the target string. Without this option, if the last character is a line break, it will also be matched. If the m modifier is set, ignore this option.

S:
When a mode is used several times, it is worth analyzing for acceleration matching. If this modifier is set, additional analysis is performed. Currently, the analysis mode is only useful for non-anchored modes without a single fixed start character.

U (PCRE_UNGREEDY ):
Make "?" Is greedy by default.

X (PCRE_EXTRA ):
Any backslash followed by a letter with no special meaning in the pattern causes an error, so that this combination is retained for future expansion. By default, a backslash followed by a letter without special meaning is treated as the letter itself.

U (PCRE_UTF8 ):
The pattern string is treated as a UTF-8.

Logical separation:
POSIX is compatible with regular expressions and PERL is compatible with regular expressions. The functions and usage of logical separation symbols are exactly the same:
[]: Contains information about any operation.
{}: Contains information about the number of matching times.
(): Contains information about a logical interval and can be used for reference operations.
|: "Or", [AB] And a | B are equivalent.

Metacharacters are related:

There are two groups of different metacharacters: one is that the pattern can be recognized in addition to square brackets, and the other is recognized in square brackets.

POSIX is compatible with regular expressions and PERL. It is compatible with metacharacters other than regular expressions:
\ Common escape characters for several purposes
^ Match the start of a string
$ Match the end of a string
? Match 0 or 1
* Matches 0 or more characters of the specified type.
+ Match one or more characters of the specified type

POSIX compatible with regular expressions and PERL Compatible with metacharacters other than regular expressions "[]" inconsistent:
. PERL Compatible with any character except line breaks
. POSIX compatible with regular expression matching any character

POSIX is compatible with regular expressions and PERL. It is compatible with metacharacters within the regular:
\ Common escape characters for several purposes
^ It is valid only when it is the first character.
-Specify the ASCII range of characters and study the ASCII code carefully. You will find that [W-c] is equivalent to [WXYZ \ ^ _ 'abc]

POSIX is compatible with regular expressions and PERL. It is compatible with metacharacters that are "inconsistent" within the regular expression:
-If POSIX is compatible with [a-c-e] in regular expressions, an error is thrown.
-In PERL Compatible Regular Expressions, [a-c-e] is equivalent to [a-e].

The number of matches is related:
POSIX Compatible Regular Expressions and PERL Compatible Regular Expressions are exactly the same in terms of matching times:
{2}: match the previous character twice
{2 ,}: match the first character twice or multiple times. By default, the match is greedy (as many as possible ).
{2, 4}: match the previous character twice or four times

The logical interval is related:
The region contained in () is a logical interval. The main function of a logical interval is to reflect the logical order of some characters, another use can be used for reference (values in this range can be referenced to a variable ). The latter has a special role:
<? Php
$ Str = "http://www.163.com /";
// POSIX Compatible Regular Expressions:
Echo ereg_replace ("(. +)", "<a href =\\ 1 >\\ 1 </a>", $ str );
// PERL Compatible with regular expressions:
Echo preg_replace ("/(. +)/", "<a href = $1> $1 </a>", $ str );
// Display two links
?>

When referencing, parentheses can be nested, and the logical order is calibrated according to the order in which ("appears.

Type match:
POSIX Compatible Regular Expressions:
[: Upper:]: match all uppercase letters
[: Lower:]: match all lowercase letters
[: Alpha:]: match all letters
[: Alnum:]: match all letters and numbers
[: Digit:]: Match All numbers
[: Xdigit:]: match all hexadecimal characters, equivalent to [0-9A-Fa-f]
[: Punct:]: match all punctuation marks, equivalent [.,"'?!; :]
[: Blank:]: matches spaces and tabs, which is equivalent to [\ t]
[: Space:]: matches all blank characters, which is equivalent to [\ t \ n \ r \ f \ v]
[: Cntrl:]: match all control operators between ASCII 0 and 31.
[: Graph:]: match all printable characters, equivalent to: [^ \ t \ n \ r \ f \ v]
[: Print:]: matches all printable characters and spaces. It is equivalent to [^ \ t \ n \ r \ f \ v].
[. C.]: Unknown Function
[= C =]: Unknown Function
[: <:]: Match the start of a word
[:>:]: Match the end of a word

PERL is compatible with regular expressions (here we can see that PERL regular expressions are powerful ):
\ A alarm, that is, the BEL character ('0)
\ Cx "control-x", where x is any character
\ E escape ('0b)
\ F form feed ('0c)
\ N newline ('0a)
\ R carriage return ('0d)
\ T tab ('0)
The hexadecimal code of \ xhh is a hh character.
The "\ ddd" octal code is a ddd character or backreference
\ D any ten-digit number
\ D any non-decimal character
\ S any blank character
\ S any non-blank character
\ W any "word" Character
\ W any "non-word" Character
\ B word boundary
\ B Non-word demarcation line
\ A target's start (independent from the multiline Mode)
\ Z target end or line break at the end (independent from the multiline Mode)
\ Z target end (independent from the multiline Mode)
The first matching position in the \ G target

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.