I. Concept
The syntax pattern is similar to Perl. The expression must be closed with a delimiter, such as a forward slash (/).
The delimiter can be any non-alphanumeric, non-blank ASCII character except for backslashes (\) and null bytes
If the delimiter is used in an expression, it needs to be escaped with a backslash.
Two. Composition
Metacharacters
The basic composition of a regular expression
/Atom and metacharacters/pattern modifier/one representing delimiter
The power of a regular expression is its ability to include choices and loops in the pattern. They are encoded in patterns by using metacharacters, and metacharacters do not represent themselves, and they are parsed in a special way.
There are two types depending on whether the square brackets are internal or external.
1. metacharacters outside the square brackets
Metacharacters (symbol) |
Description |
\ |
Generally used to escape characters |
^ |
Assert the starting position of the target (or the beginning of a line in multiline mode) |
$ |
End position of the target (live in multiline mode downstream) |
. |
Match any character except line break (by default) |
[,] |
Start, end character class definition |
| |
Start an optional branch |
( ,) |
Start of child group, end tag |
? |
As a quantifier, represents 0 or 1 matches. The greedy character that is used to change quantifiers after quantifiers |
* |
quantifier, 0 or more matches |
+ |
quantifier, 1 or more matches |
{ ,} |
Custom quantifier start tag, end tag |
2. The part of the pattern in brackets is called the "character class"
Metacharacters |
Description |
\ |
Escape character |
^ |
Indicates that the character class is reversed only when it is the first character |
- |
Mark Character Range |
Example of a meta-character usage description
1. Escape (backslash)
\ is followed by a non-alphanumeric character, cancels any special meaning that the character may have. This applies to either the character class or the other.
For non-alphanumeric characters, it is always necessary to match the original text with a backslash in front of it to represent itself.
Match "*", because of its special meaning, so with "\*" removed its special meaning
Match "." Use "\."
Match "\" with "\ \"
But be aware that:
Backslashes have special meanings in both single-quote strings and double-quote strings, so to match a backslash, the pattern must write "\\\\" or ' \\\ '
2. The second use of backslashes provides a means of controlling the visible encoding of nonprinting characters. In addition to the binary 0 will end a pattern, does not strictly restrict the appearance of nonprinting characters (itself), but when a pattern is edited in the form of a text editor, it is easier to use the following escape sequence than to use binary characters.
Symbol |
Description |
\a |
Bell character (Hex 07) |
\cx |
"Control-x", X is any character |
\e |
Escaped (hex 1B) |
\f |
Page break (hex 0C) |
\ n |
Line break (hex 0A) |
\P{XX} (P lowercase) |
A character that conforms to the XX attribute |
\P{XX} (P capital) |
A character that does not conform to the XX attribute |
\ r |
Enter (Hex 0D) |
\ t |
Horizontal tab (Hex 09) |
\xhh |
HH hexadecimal-encoded character |
\ddd |
DDD octal encoded character, or back reference |
\040 |
Another way to use spaces |
\40 |
It is also considered a space when less than 40 subgroups are provided. |
\7 |
Always a back reference |
\11 |
May be a back reference or a tab |
\011 |
Always a tab |
\0113 |
A tab is followed by a 3 (because a maximum of 3 8 binary bits are read at a time |
\113 |
Characters represented by octal 113 |
\377 |
8 Binary 377 is 10 binary 255, so it represents a full 1 character |
\81 |
A back reference or a binary 0 followed by two digits 8 and 1 (because 8 is not a valid number of 8) |
3. A third use of backslashes that describes a particular character class
Symbol |
Description |
\d |
arbitrary decimal digits |
\d |
Any non-decimal number |
\h |
Any horizontal whitespace character (from PHP 5.2.4) |
\h |
Any non-horizontal whitespace character (from PHP 5.2.4) |
\s |
Any whitespace character |
\s |
Any whitespace character |
\v |
Any vertical whitespace character (since PHP 5.2.4) |
\v |
Any non-vertical whitespace character (since PHP 5.2.4) |
\w |
Any word character |
\w |
Any non-word character |
Each of the above pairs of escape sequences represents two disjoint parts of the full character set, and any character must match one of them and must not match the other.
Fourth simple assertion of usage
\b |
Word Boundary notice in character class is backspace |
\b |
Non-word boundary |
\a |
The starting position of the target (independent of multiline mode) |
\z |
The end position or line break at the end of the target (independent of multiline mode) |
\z |
End position of the target (independent of multiline mode) |
\g |
First match position in target |
\a,\z,\z assertions differ from the traditional ^ and $
Because they always match the start and end of the target string, without the restriction of the pattern modifier
The difference between \z and \z is that when the string ends with a newline character \z it as a string end match, and \z matches only the end of the string.
Code 1
$p = ' #\a[a-z]{3} #m '; $str = ' abcdefghijkl ';p reg_match_all ($p, $str, $all);p Rint_r ($all);
I found that the addition of the pattern modifier m results in the same
Match only to ABC
and Code 2:
$p = ' #^[a-z]{3} #m '; $str = ' abcdefghijkl ';p reg_match_all ($p, $str, $all);p Rint_r ($all);
Do not add M only match ABC
Plus, it matches the abc,def,hij.
Indeed, \a is not affected by pattern modifiers
To compare $ with \z
\z and \z
Code 3
$p = ' #[a-z]\z# '; $str = "a\n"; Preg_match_all ($p, $str, $all); Print_r ($all);
Pattern correction to \e when matched to a
When the mode is fixed to \e, because it matches only to the end of the character, it does not recognize the newline character, so it doesn't match anything.
G asserts that the $offset parameter is specified in the Preg_match () () call, only if the current match is at the start point of the match.
When the value of $offset is not 0, it is different from \a.
See PHP Manual
Starting with PHP 4.3.3, \q and \e can be used to omit regular expression metacharacters from the pattern.
Which is to place characters with special meanings between \q and \e.
such as code 4
$p = ' #\w+\q.$.\e$# '; $str = "a.$."; Preg_match_all ($p, $str, $all);p Rint_r ($all);
Match to a.$.
Starting from PHP 5.2.4. The \k can be used to reset the match. For example, Foot\kbar matches "Footbar". But the resulting match is "bar". However, the use of \k does not interfere with the content within the subgroup, such as (foot) \kbar match "Footbar", and the result in the first subgroup will still be "foo". \k: The effect is the same when placed outside subgroups and subgroups.
\P{LU} matches uppercase letters
Period
Outside the character class
\c can be used to match a single byte, meaning that a period can match multibyte characters in UTF-8 mode
Character classes |
Alnum |
Letters and Numbers |
Alpha |
Letters |
Ascii |
ASCII Characters of 0-127 |
Blank |
Spaces and Horizontal Tabs |
Cntrl |
Control characters |
Digit |
Decimal number (same as \d) |
Graph |
Print characters, not including spaces |
Lower |
lowercase letters |
Print |
Print characters, including spaces |
Punct |
Print characters, excluding letters and numbers |
Space |
whitespace characters (more vertical tabs than \s) |
Upper |
Capital |
Word |
Word character (same as \w) |
Xdigit |
hexadecimal digits |
such as ' #[[:upper:]]# ' matches uppercase letters
' #[[:alpha:]]# ' matches letters
Optional path |
The vertical bar character is used to detach the optional path in the pattern. Like pattern gilbert|. Sullivan matches "Gilbert" or "Sullivan". The vertical bar can have any number of occurrences in the pattern, and allows for an optional path that is empty (matches an empty string). The matching processing attempts each of the optional paths from left to right, and uses the first successfully matched one. If the optional path is in a subgroup (defined below), a successful match means that both the branch in the sub-pattern and the other part of the main mode are matched.
Code 5
$p = ' #p (hp|ython|erl) # '; $str = "php python perl";p Reg_match_all ($p, $str, $all);p Rint_r ($all);
Sub-group (sub-mode)
Subgroups are delimited by parentheses, and they can be nested, mainly in the following two usages and functions
1. Localize the optional branch. Like what
Mode P (Hp|ython|erl) matches one of the Php,python,perl
2. Set the subgroup as the capturing sub-group.
After the pattern is matched, the left parenthesis appears from left to right in the order in which the sub-group is subscript (starting at 1), which can be used to obtain the captured sub-pattern matching results.
Code 6
$p = ' # (\d) # '; $str = "abc123"; $r =preg_replace ($p, ' \1 ', $str); Echo $r;
But when you just want to group and don't want to capture,
The string "?:" immediately following the left parenthesis defined by the subgroup causes the subgroup not to be captured separately, and does not affect the calculation of the subsequent subgroup ordinal
Code Listing 7: matching numbers changing numbers to red
$p = ' #.* (?: \ d). * ([A-z]) #U '; $str = "3df5g"; $r =preg_replace ($p, ' \1 ', $str); Echo $r;
If the pattern of matching numbers does not add:
Then \1 represents the match is the number, plus only after the packet is not captured, \1 represents the captured letter
For easy shorthand, if you need to set options at the start of a non-capturing subgroup, the option letter can be located? And: Between, for example:
(? i:saturday| Sunday)
(?:(? i) saturday| Sunday)
Where I is the pattern modifier, ignoring case
The above two formulations are actually the same pattern. Because the optional branch tries each branch from left to right, and the option is not reset before the end of the sub-mode, and because the options are set to penetrate through the other branches later, the above pattern will match "SUNDAY" and "Saturday".
In PHP 4.3.3, you can use a child group (? The syntax of P pattern) is named. This sub-pattern will appear in the matching results at the same time in its name and order (digital subscript), PHP 5.2.2 added two flavors subgroup naming syntax: (? pattern) and (? ') Name ' pattern ').
The code is as follows 8:
$p = "#.* (? [ A-z]{3}) (? ' Digit ' \d{3}). *# "; $str =" abc123111def111g ";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results:
Sometimes multiple matches are required to select subgroups in a regular expression. To allow multiple subgroups to share a back-reference number problem, the (? \ syntax allows you to copy numbers.) Consider the following regular expression matching Sunday:
(?:( Sat) ur| (Sun)) Day
Here, when the back reference is 1 null, Sun is stored in the back reference 2. When the back reference 2 does not exist, the Sat is stored in the back reference 1. Use (? | Modify the mode to fix the problem:
Code Listing 9:
$p = ' # (?:( Sat) ur| (Sun)) day# '; $str = "Sunday Saturday";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results:
(?| (Sat) ur| (Sun)) Day
With this pattern, sun and the SAT are stored in the back reference 1.
Look at this pattern before looking at the 2 code below
Code 10-1
$p = ' # (a|b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
The result is: Array
( [0] = = Array ( [0] = B2 [1] = A1 ) [1] = = Array ( [0] = b [1] =&G T A )) code 10-2
$p = ' # ((a) |b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results:
Array ([ 0] = = Array ( [0] = = B2 [1] = A1 ) [1] = = Array ( [0] + = b [ 1] + a ) [2] = = Array ([0] = [1] = a )) to 10-2 code: the first complete match to the content is B2, So including the parentheses that match content B is the first sub-pattern of which is B, the second sub-pattern because (a) does not match, so the second full match to A1, its first sub-pattern is a, the second because ((a) |b) is the outer curly brace contains code 10-3:
$p = ' # ((a) | ( b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results:
Array ([ 0] = = Array ( [0] = = B2 [1] = A1 ) [1] = = Array ( [0] + = b [ 1] + a ) [2] = = array ([0] = = [1] = a ) [3] = = Array ( [0 ] = b [1] = = ))
Code Listing 10-4:
$p = ' # (?:( a) | (b)) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Result: Array ([ 0] = = Array ( [0] = = B2 [1] = A1 ) [1] = = Array ( [0] = > [1] + a ) [2] = = Array ( [0] = + b [1] = =))
Code Listing 10:
$p = ' # (? | (Sat) ur| (Sun)) day# '; $str = "Sunday Saturday";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results
Back to reference
If the number that follows the backslash is less than 10, it is always a back reference. The number of captures in a pattern is greater than or equal to the number of back references
A back reference directly matches what is actually captured by the referenced capturing group in the target string, rather than matching the contents of the sub-group pattern
(Sens|respons) E and \1ibility will match "Sense and Sensibility" and "response and responsibility" without matching "sense and responsibility ”。
Code 11
$p = ' # (sens|respons) E and \1ibility# '; $str = "Sense and sensibility response and responsibility sense and Responsibility ";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results
AB (? i) c matches ABC and ABC
(? i) + Atom
The atoms after the expression (? i) are not case-sensitive
If a case-sensitive match is enforced when a back reference is made
((? i) ABC) \s+\1
Match ABC ABC
ABC abc
ABC abc
As long as two of them are not the same case.
But does not match ABC abc, etc.
The thing to consider here is that the expected content of the back reference is exactly the same as what the captured subgroup of the reference gets.
Code Listing 12:
$p = ' # ((? i) ABC) \s+\1# '; $str = "ABC abc | ABC abc | ABC abc |abc ABC ";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results
There may be more than one back reference that references the same child group. A subgroup may not actually be used for a particular match, and any subsequent references to that subgroup will fail.
First look at the following code 13
$p = ' # (a| ( BC) # '; $str = "abc";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Completely matched 2 times
[0] [0] is the first complete match
[1] [0] is the first sub-pattern to match
[2] [0] is the first match of the second sub-pattern
[0] [1] Second complete match
[1] [1] Second match of the first sub-mode
[2] [1] Second sub-pattern for second match
From the above can be found for the pattern
(a| (BC))
The outermost parenthesis is the first matching sub-pattern
Inside the parentheses is the second sub-pattern
So for the following code 14:
$p = ' # (a| ( BC) \2# '; $str = "AABCBC";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results
When you first match a, there is no second sub-pattern.
There's no \2 to talk about.
So the first complete match must have the opportunity to make the second sub-pattern exist, that the contents of the parentheses inside must be matched, so there must be a BC to match.
Since there can be as many as 99 back references, all numbers immediately following the backslash may be a potential back reference count. If the pattern is followed by a numeric character after the back reference, some separators must be used for terminating the reference syntax.
For example, the following code 15:
$p = ' # ([a-z]{3}) \1 5#x '; $str = "Aaaaaa5";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Pattern back to reference \1 immediately after the number, like the above code will be mistaken for a 15th reference
We empty the next box and then ignore the pattern in the pattern correction to successfully match the space
If a back reference appears inside the subgroup it refers to, its match fails
(a\1) will not get any matches
And this reference can be used for internal sub-pattern repetition
(a|b\1) matches "a" but does not match B (because there is an optional path within the subgroup, there is a path in the optional path to complete the match, and after the match is complete, the back reference is able to reference the content).
Code Listing 16:
$p = ' # (a|b\1) +# '; $str = "ABBA";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results
In each iteration of the sub-pattern, the back reference matches the string to which the subgroup matched the last iteration. In order to do this, the pattern must satisfy a condition that, at the first iteration, the pattern must be able to guarantee that no matching back reference is required. This condition can be implemented as the above example with an optional path, or it can be done by using a quantifier with a minimum value of 0 to be modified to a reference.
After PHP 5.2.2, the \g escape sequence can be used for absolute and relative references to sub-schemas. This escape sequence must be immediately followed by an unsigned number or a negative number, optionally wrapping the numbers with parentheses. Sequence \1, \g1,\g{1} is a synonym relationship. This usage eliminates the ambiguity that occurs when a backslash is used to describe a reverse reference immediately after the value. This escape sequence facilitates the distinction between a back reference and an octal numeric character, and makes it clearer that a back reference is followed by a text-matching number, such as \g{2}1.
Code Listing 17:
$p = ' # ([a-z]{2}) \g{1}5# '; $str = "Abab5";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Can be compared with code 15
The \g escape sequence immediately follows a negative number to represent a relative back reference. For example: (foo) (bar) \g{-1} can match the string "Foobarbar", (foo) (bar) \g{-2} can match "Foobarfoo". This is used as an optional scheme in a long pattern to keep track of the subgroup ordinal of a reference to a specific subgroup of the previous group.
Code 18
$p = ' # (foo) (bar) \g{-1}# '; $p 1= ' # (foo) (bar) \g{-2}# '; $str = "Foobarbar"; $str 1= "Foobarfoo";p reg_match_all ($p, $STR, $ ARR);p Reg_match_all ($p 1, $str 1, $arr 1);p Rint_r ($arr);p Rint_r ($arr 1);
Results:
A back reference also supports a syntax description that uses a subgroup name, such as (? P=name) or PHP 5.2.2 can be useful \k or \k ' name '. Support for \k{name} and \g{name} is also included in PHP 5.2.4.
Code Listing 19:
$p = "# (? ') Alpha ' [a-z]{2}] (?
[0-9]{3}] \k
(? P=alpha) # "; $str =" AA123123AA ";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
Results:
Can be compared with code 8 to see
Pay attention to the marked red
Alpha before one with quotation marks, the latter one without
P Capitalization
Resources:
Not to be continued ....