A summary of the PHP (i)

Source: Internet
Author: User
Tags character classes control characters alphanumeric characters
I. Concept

The syntax pattern is similar to Perl. The expression must be closed with a delimiter, such as a forward slash (/).

The delimiter can be any non-alphanumeric, non-blank ASCII character except for backslashes (\) and null bytes

If the delimiter is used in an expression, it needs to be escaped with a backslash.

Two. Composition

Metacharacters

The basic composition of a regular expression

/Atom and metacharacters/pattern modifier/one representing delimiter

The power of a regular expression is its ability to include choices and loops in the pattern. They are encoded in patterns by using metacharacters, and metacharacters do not represent themselves, and they are parsed in a special way.

There are two types depending on whether the square brackets are internal or external.

1. metacharacters outside the square brackets

Metacharacters (symbol)

Description

\

Generally used to escape characters

^

Assert the starting position of the target (or the beginning of a line in multiline mode)

$

End position of the target (live in multiline mode downstream)

.

Match any character except line break (by default)

[,]

Start, end character class definition

|

Start an optional branch

( ,)

Start of child group, end tag

?

As a quantifier, represents 0 or 1 matches. The greedy character that is used to change quantifiers after quantifiers

*

quantifier, 0 or more matches

+

quantifier, 1 or more matches

{ ,}

Custom quantifier start tag, end tag

2. The part of the pattern in brackets is called the "character class"

Metacharacters

Description

\

Escape character

^

Indicates that the character class is reversed only when it is the first character

-

Mark Character Range

Example of a meta-character usage description

1. Escape (backslash)

\ is followed by a non-alphanumeric character, cancels any special meaning that the character may have. This applies to either the character class or the other.

For non-alphanumeric characters, it is always necessary to match the original text with a backslash in front of it to represent itself.

Match "*", because of its special meaning, so with "\*" removed its special meaning

Match "." Use "\."

Match "\" with "\ \"

But be aware that:

Backslashes have special meanings in both single-quote strings and double-quote strings, so to match a backslash, the pattern must write "\\\\" or ' \\\ '

2. The second use of backslashes provides a means of controlling the visible encoding of nonprinting characters. In addition to the binary 0 will end a pattern, does not strictly restrict the appearance of nonprinting characters (itself), but when a pattern is edited in the form of a text editor, it is easier to use the following escape sequence than to use binary characters.

Symbol

Description

\a

Bell character (Hex 07)

\cx

"Control-x", X is any character

\e

Escaped (hex 1B)

\f

Page break (hex 0C)

\ n

Line break (hex 0A)

\P{XX} (P lowercase)

A character that conforms to the XX attribute

\P{XX} (P capital)

A character that does not conform to the XX attribute

\ r

Enter (Hex 0D)

\ t

Horizontal tab (Hex 09)

\xhh

HH hexadecimal-encoded character

\ddd

DDD octal encoded character, or back reference

\040

Another way to use spaces

\40

It is also considered a space when less than 40 subgroups are provided.

\7

Always a back reference

\11

May be a back reference or a tab

\011

Always a tab

\0113

A tab is followed by a 3 (because a maximum of 3 8 binary bits are read at a time

\113

Characters represented by octal 113

\377

8 Binary 377 is 10 binary 255, so it represents a full 1 character

\81

A back reference or a binary 0 followed by two digits 8 and 1 (because 8 is not a valid number of 8)

3. A third use of backslashes that describes a particular character class

Symbol

Description

\d

arbitrary decimal digits

\d

Any non-decimal number

\h

Any horizontal whitespace character (from PHP 5.2.4)

\h

Any non-horizontal whitespace character (from PHP 5.2.4)

\s

Any whitespace character

\s

Any whitespace character

\v

Any vertical whitespace character (since PHP 5.2.4)

\v

Any non-vertical whitespace character (since PHP 5.2.4)

\w

Any word character

\w

Any non-word character

Each of the above pairs of escape sequences represents two disjoint parts of the full character set, and any character must match one of them and must not match the other.

Fourth simple assertion of usage

\b

Word Boundary notice in character class is backspace

\b

Non-word boundary

\a

The starting position of the target (independent of multiline mode)

\z

The end position or line break at the end of the target (independent of multiline mode)

\z

End position of the target (independent of multiline mode)

\g

First match position in target

\a,\z,\z assertions differ from the traditional ^ and $

Because they always match the start and end of the target string, without the restriction of the pattern modifier

The difference between \z and \z is that when the string ends with a newline character \z it as a string end match, and \z matches only the end of the string.

Code 1

$p = ' #\a[a-z]{3} #m '; $str = ' abcdefghijkl ';p reg_match_all ($p, $str, $all);p Rint_r ($all);

I found that the addition of the pattern modifier m results in the same

Match only to ABC

and Code 2:

$p = ' #^[a-z]{3} #m '; $str = ' abcdefghijkl ';p reg_match_all ($p, $str, $all);p Rint_r ($all);

Do not add M only match ABC

Plus, it matches the abc,def,hij.

Indeed, \a is not affected by pattern modifiers

To compare $ with \z

\z and \z

Code 3

$p = ' #[a-z]\z# '; $str = "a\n"; Preg_match_all ($p, $str, $all); Print_r ($all);

Pattern correction to \e when matched to a

When the mode is fixed to \e, because it matches only to the end of the character, it does not recognize the newline character, so it doesn't match anything.

G asserts that the $offset parameter is specified in the Preg_match () () call, only if the current match is at the start point of the match.

When the value of $offset is not 0, it is different from \a.

See PHP Manual

Starting with PHP 4.3.3, \q and \e can be used to omit regular expression metacharacters from the pattern.

Which is to place characters with special meanings between \q and \e.

such as code 4

$p = ' #\w+\q.$.\e$# '; $str = "a.$."; Preg_match_all ($p, $str, $all);p Rint_r ($all);

Match to a.$.

Starting from PHP 5.2.4. The \k can be used to reset the match. For example, Foot\kbar matches "Footbar". But the resulting match is "bar". However, the use of \k does not interfere with the content within the subgroup, such as (foot) \kbar match "Footbar", and the result in the first subgroup will still be "foo". \k: The effect is the same when placed outside subgroups and subgroups.

\P{LU} matches uppercase letters

Period

Outside the character class

\c can be used to match a single byte, meaning that a period can match multibyte characters in UTF-8 mode

Character classes

Alnum

Letters and Numbers

Alpha

Letters

Ascii

ASCII Characters of 0-127

Blank

Spaces and Horizontal Tabs

Cntrl

Control characters

Digit

Decimal number (same as \d)

Graph

Print characters, not including spaces

Lower

lowercase letters

Print

Print characters, including spaces

Punct

Print characters, excluding letters and numbers

Space

whitespace characters (more vertical tabs than \s)

Upper

Capital

Word

Word character (same as \w)

Xdigit

hexadecimal digits

such as ' #[[:upper:]]# ' matches uppercase letters

' #[[:alpha:]]# ' matches letters

Optional path |

The vertical bar character is used to detach the optional path in the pattern. Like pattern gilbert|. Sullivan matches "Gilbert" or "Sullivan". The vertical bar can have any number of occurrences in the pattern, and allows for an optional path that is empty (matches an empty string). The matching processing attempts each of the optional paths from left to right, and uses the first successfully matched one. If the optional path is in a subgroup (defined below), a successful match means that both the branch in the sub-pattern and the other part of the main mode are matched.

Code 5

$p = ' #p (hp|ython|erl) # '; $str = "php python perl";p Reg_match_all ($p, $str, $all);p Rint_r ($all);

Sub-group (sub-mode)

Subgroups are delimited by parentheses, and they can be nested, mainly in the following two usages and functions

1. Localize the optional branch. Like what

Mode P (Hp|ython|erl) matches one of the Php,python,perl

2. Set the subgroup as the capturing sub-group.

After the pattern is matched, the left parenthesis appears from left to right in the order in which the sub-group is subscript (starting at 1), which can be used to obtain the captured sub-pattern matching results.

Code 6

$p = ' # (\d) # '; $str = "abc123"; $r =preg_replace ($p, ' \1 ', $str); Echo $r;

But when you just want to group and don't want to capture,

The string "?:" immediately following the left parenthesis defined by the subgroup causes the subgroup not to be captured separately, and does not affect the calculation of the subsequent subgroup ordinal

Code Listing 7: matching numbers changing numbers to red

$p = ' #.* (?: \ d). * ([A-z]) #U '; $str = "3df5g"; $r =preg_replace ($p, ' \1 ', $str); Echo $r;

If the pattern of matching numbers does not add:

Then \1 represents the match is the number, plus only after the packet is not captured, \1 represents the captured letter

For easy shorthand, if you need to set options at the start of a non-capturing subgroup, the option letter can be located? And: Between, for example:

(? i:saturday| Sunday)

(?:(? i) saturday| Sunday)

Where I is the pattern modifier, ignoring case

The above two formulations are actually the same pattern. Because the optional branch tries each branch from left to right, and the option is not reset before the end of the sub-mode, and because the options are set to penetrate through the other branches later, the above pattern will match "SUNDAY" and "Saturday".

In PHP 4.3.3, you can use a child group (? The syntax of P pattern) is named. This sub-pattern will appear in the matching results at the same time in its name and order (digital subscript), PHP 5.2.2 added two flavors subgroup naming syntax: (? pattern) and (? ') Name ' pattern ').

The code is as follows 8:

$p = "#.* (? [ A-z]{3}) (? ' Digit ' \d{3}). *# "; $str =" abc123111def111g ";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results:

Sometimes multiple matches are required to select subgroups in a regular expression. To allow multiple subgroups to share a back-reference number problem, the (? \ syntax allows you to copy numbers.) Consider the following regular expression matching Sunday:

(?:( Sat) ur| (Sun)) Day

Here, when the back reference is 1 null, Sun is stored in the back reference 2. When the back reference 2 does not exist, the Sat is stored in the back reference 1. Use (? | Modify the mode to fix the problem:

Code Listing 9:

$p = ' # (?:( Sat) ur| (Sun)) day# '; $str = "Sunday Saturday";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results:

(?| (Sat) ur| (Sun)) Day

With this pattern, sun and the SAT are stored in the back reference 1.

Look at this pattern before looking at the 2 code below

Code 10-1

$p = ' # (a|b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

The result is: Array

(    [0] = = Array        (            [0] = B2            [1] = A1        )    [1] = = Array        (            [0] = b            [1] =&G T A        )) code 10-2

$p = ' # ((a) |b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results:

Array ([    0] = = Array        (            [0] = = B2            [1] = A1        )    [1] = = Array        (            [0] + = b            [ 1] + a        )    [2] = =        Array            ([0] =             [1] = a        )) to 10-2 code: the first complete match to the content is B2, So including the parentheses that match content B is the first sub-pattern of which is B, the second sub-pattern because (a) does not match, so the second full match to A1, its first sub-pattern is a, the second because ((a) |b) is the outer curly brace contains code 10-3:

$p = ' # ((a) | ( b) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results:
Array ([    0] = = Array        (            [0] = = B2            [1] = A1        )    [1] = = Array        (            [0] + = b            [ 1] + a        )    [2] = =        array            ([0] = =             [1] = a        )    [3] = = Array        (            [0 ] = b            [1] = =         ))

Code Listing 10-4:

$p = ' # (?:( a) | (b)) \d# '; $str = "B2A1";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Result: Array ([    0] = = Array        (            [0] = = B2            [1] = A1        )    [1] = = Array        (            [0] = >             [1] + a        )    [2] = = Array        (            [0] = + b            [1] =         =))

Code Listing 10:

$p = ' # (? | (Sat) ur| (Sun)) day# '; $str = "Sunday Saturday";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results

Back to reference

If the number that follows the backslash is less than 10, it is always a back reference. The number of captures in a pattern is greater than or equal to the number of back references

A back reference directly matches what is actually captured by the referenced capturing group in the target string, rather than matching the contents of the sub-group pattern

(Sens|respons) E and \1ibility will match "Sense and Sensibility" and "response and responsibility" without matching "sense and responsibility ”。

Code 11

$p = ' # (sens|respons) E and \1ibility# '; $str = "Sense and sensibility response and responsibility sense and  Responsibility ";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results

AB (? i) c matches ABC and ABC

(? i) + Atom

The atoms after the expression (? i) are not case-sensitive

If a case-sensitive match is enforced when a back reference is made

((? i) ABC) \s+\1

Match ABC ABC

ABC abc

ABC abc

As long as two of them are not the same case.

But does not match ABC abc, etc.

The thing to consider here is that the expected content of the back reference is exactly the same as what the captured subgroup of the reference gets.

Code Listing 12:

$p = ' # ((? i) ABC) \s+\1# '; $str = "ABC abc | ABC abc  | ABC abc |abc ABC ";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results

There may be more than one back reference that references the same child group. A subgroup may not actually be used for a particular match, and any subsequent references to that subgroup will fail.

First look at the following code 13

$p = ' # (a| ( BC) # '; $str = "abc";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Completely matched 2 times

[0] [0] is the first complete match

[1] [0] is the first sub-pattern to match

[2] [0] is the first match of the second sub-pattern

[0] [1] Second complete match

[1] [1] Second match of the first sub-mode

[2] [1] Second sub-pattern for second match

From the above can be found for the pattern

(a| (BC))

The outermost parenthesis is the first matching sub-pattern

Inside the parentheses is the second sub-pattern

So for the following code 14:

$p = ' # (a| ( BC) \2# '; $str = "AABCBC";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results

When you first match a, there is no second sub-pattern.

There's no \2 to talk about.

So the first complete match must have the opportunity to make the second sub-pattern exist, that the contents of the parentheses inside must be matched, so there must be a BC to match.

Since there can be as many as 99 back references, all numbers immediately following the backslash may be a potential back reference count. If the pattern is followed by a numeric character after the back reference, some separators must be used for terminating the reference syntax.

For example, the following code 15:

$p = ' # ([a-z]{3}) \1 5#x '; $str = "Aaaaaa5";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Pattern back to reference \1 immediately after the number, like the above code will be mistaken for a 15th reference

We empty the next box and then ignore the pattern in the pattern correction to successfully match the space

If a back reference appears inside the subgroup it refers to, its match fails

(a\1) will not get any matches

And this reference can be used for internal sub-pattern repetition

(a|b\1) matches "a" but does not match B (because there is an optional path within the subgroup, there is a path in the optional path to complete the match, and after the match is complete, the back reference is able to reference the content).

Code Listing 16:

$p = ' # (a|b\1) +# '; $str = "ABBA";p Reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Results

In each iteration of the sub-pattern, the back reference matches the string to which the subgroup matched the last iteration. In order to do this, the pattern must satisfy a condition that, at the first iteration, the pattern must be able to guarantee that no matching back reference is required. This condition can be implemented as the above example with an optional path, or it can be done by using a quantifier with a minimum value of 0 to be modified to a reference.

After PHP 5.2.2, the \g escape sequence can be used for absolute and relative references to sub-schemas. This escape sequence must be immediately followed by an unsigned number or a negative number, optionally wrapping the numbers with parentheses. Sequence \1, \g1,\g{1} is a synonym relationship. This usage eliminates the ambiguity that occurs when a backslash is used to describe a reverse reference immediately after the value. This escape sequence facilitates the distinction between a back reference and an octal numeric character, and makes it clearer that a back reference is followed by a text-matching number, such as \g{2}1.

Code Listing 17:

$p = ' # ([a-z]{2}) \g{1}5# '; $str = "Abab5";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);

Can be compared with code 15

The \g escape sequence immediately follows a negative number to represent a relative back reference. For example: (foo) (bar) \g{-1} can match the string "Foobarbar", (foo) (bar) \g{-2} can match "Foobarfoo". This is used as an optional scheme in a long pattern to keep track of the subgroup ordinal of a reference to a specific subgroup of the previous group.

Code 18

$p = ' # (foo) (bar) \g{-1}# '; $p 1= ' # (foo) (bar) \g{-2}# '; $str = "Foobarbar"; $str 1= "Foobarfoo";p reg_match_all ($p, $STR, $ ARR);p Reg_match_all ($p 1, $str 1, $arr 1);p Rint_r ($arr);p Rint_r ($arr 1);

Results:

A back reference also supports a syntax description that uses a subgroup name, such as (? P=name) or PHP 5.2.2 can be useful \k or \k ' name '. Support for \k{name} and \g{name} is also included in PHP 5.2.4.

Code Listing 19:

$p = "# (? ') Alpha ' [a-z]{2}] (?
 
  
   
  [0-9]{3}] \k
  
   
    
   (? P=alpha) # "; $str =" AA123123AA ";p reg_match_all ($p, $str, $arr);p Rint_r ($arr);
  
   
 
  

Results:

Can be compared with code 8 to see

Pay attention to the marked red

Alpha before one with quotation marks, the latter one without

P Capitalization

Resources:

Not to be continued ....

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.