PHP regular expression notes

Source: Internet
Author: User
Tags character classes php regular expression printable characters repetition expression engine
PHP regular expression note what is a regular expression

On the computer, we often use (wildcard) to find the files we need, for example, *. doc. here * indicates matching zero or multiple characters. Regular expressions are also a tool for text matching, but they are more powerful. Reference a sentence in the PHP manual:A regular expression is a pattern that matches the target string from left to right. most characters represent a pattern that matches them.

Below are a few simple examples to give a preliminary understanding of the regular expression.

Hi // Match English characters (case-insensitive) hi, HI, Hi, hI

\ Bhi \ B // match the English word hi' \ B '. it is a special character (an assertion) in the regular expression, indicating the word boundary.

\ Bhi \ B. * \ bLucy \ B // matching example: 'Hi my name is Lucy ''. '*' indicates matching any character other than line breaks. '*' is a quantizer, indicating zero or more times of repetition.

0 \ d {2}-\ d {8} // matching example: 020-12345678 '\ d' matches a number (0-9) '{n}' repeats n times, for example, {2} {8}

In the above example, \ B,., *, \ d, {2} all have special meanings, which are described below.

Regular Syntax in PHP 1. Introduction

In PHP, two types of regular expressions are supported: POSIX and PCRE. Since PHP 5.3.0, POSIX regular expression extensions have been deprecated. Therefore, the PCRE-based model is discussed below. You can click to view the differences between POSIX regular expressions and perl.

2. separator

When using the PCRE function, the mode needs to be closed by the separator. The delimiter can be any non-alphanumeric, non-backslash, or non-blank character. Frequently used delimiters are forward slashes/, hash symbols #, and reverse symbols ~ . The following examples use valid delimiters.

/foo bar/#^[^0-9]$#+php+%[a-zA-Z0-9_-]%

If the delimiter needs to be matched in the mode, it must be escaped using a backslash. If separators often appear in the mode, a better choice is to use other separators to improve readability. Example:

/http:\/\//#http://#

3. metacharacters

The power of a regular expression is that it can have the ability to select and repeat in a pattern. Some characters are given special meanings, so that they do not simply represent themselves. this encoding character with special meanings in the pattern is calledMetacharacters.
There are two different metacharacters: one can be used anywhere outside the Chinese brackets of the pattern, and the other must be used in square brackets.

The metacharacters used outside Square brackets are as follows:

Code description
/ It is generally used to escape characters.
^ The start position of the assertion target (or the first row in multi-row mode)
$ The end position of the assertion target (or the end of the row in multi-row mode)
. Match any character except linefeed (default)
[ Start character class definition
] End character class definition
| Start an optional branch
( Start tag of the sub-group
) End tag of the sub-group
? A: As a quantizer, it indicates 0 or 1 matching. B: Behind the quantifiers, it is used to change the greedy nature of quantifiers.
* Quantifiers, 0 or multiple matches
+ Quantifiers, matching once or multiple times
{ Start marking of custom quantifiers
} Custom quantifiers end mark

The section in the square brackets of the pattern is called "character class ". Only the following metacharacters are available in a character class:

Code description
\ Escape characters
^ If it is used only as the first character (inside square brackets), it indicates that the character class is reversed.
- Mark character range

Example:

  • \ Ba \ w * \ B matches a word that starts with the letter a. First, a word starts with \ B, and then, then there are any number of any word characters (the word character refers to any letter, number, underline) \ w *, and finally at the end of the word \ B.
  • \ D + matches one or more consecutive numbers.
  • ^ \ D {5, 12} $ matches with 5 to 12 digits. because ^ and $ are used, the entire input string must be matched with \ d {5, 12, that is to say, the entire input must be 5 to 12 numbers.
  • 4. escape sequence (backslash)

    The backslash \ has four Usage types. for details, click the escape sequence (backslash)

    [1] as an escape character. for example, if you want to match a * character, you need to write it as \ * in the mode \*. This applies when a character is not escaped and has a special meaning. However, it is safe to add a backslash to the front of a non-alphanumeric character when it needs to match the original text. If you want to match a backslash, use \ in the mode \\.
    The backslash has special meanings in single quotes and double quotation marks. therefore, to match a backslash, the backslash must be written \\\\. Cause: first, it is used as a string, and the backslash will be escaped. Finally, the regular expression engine considers the backslash as an escape character. Therefore, four backlash lines are required to match a backslash.

    [2] provides a method to control the visible encoding of non-printable characters.

    [3] describes a specific character class.

    Code description
    \ D Any decimal number
    \ D Any non-decimal number
    \ H Any horizontal white space character (since PHP 5.2.4)
    \ H Any non-horizontal white space character (since PHP 5.2.4)
    \ S Any blank character
    \ S Any non-blank characters
    \ V Any vertical blank character (since PHP 5.2.4)
    \ V Any non-vertical white space character (since PHP 5.2.4)
    \ W Any word character. a word character refers to any letter, number, or underline.
    \ W Any non-word character

    [4] Some simple assertions. An assertion specifies a condition that must be matched at a specific position and does not consume any characters from the target string. Backlash assertions include:

  • \ B word boundary
  • \ B non-word boundary
  • \ A target start position (independent from the multiline mode)
  • \ Z target end position or line break at the end (independent from the multiline mode)
  • \ Z target end position (independent from the multiline mode)
  • \ G first matching position in Target
  • 5. repetition/quantizer Code description
    * Repeated zero or more times, equivalent
    + Repeat once or more times, equivalent
    ? Repeat zero times or once, equivalent
    {N} Repeated n times
    {N ,} Repeat n or more times
    {N, m} Repeat n to m times

    By default, quantifiers are "greedy", that is, they will match as many characters as possible (until the maximum number of matching times is allowed) without causing a pattern match failure ). However, if a quantizer is followed by one? Mark, it will become a lazy (non-greedy) pattern, it no longer matches as much as possible, but as few as possible.
    Let's take a look at the example to understand what the "greedy" and "non-greedy" modes are like.

    For the string "aa"

    Test1

    Bb

    Test2

    Cc "regular expression"

    .*

    "Matching result"

    Test1

    Bb

    Test2

    "Regular expression"

    .*?

    "Matching result"

    Test1

    "

    For more information about "greedy" and "non-greedy" patterns, see http://php.net/manual/zh/regexp.reference.repetition.php.

    6. character classes (square brackets)

    Description in the PHP manual:

  • The left square brackets start the description of a character class and end with square brackets. A separate right brace has no special meaning. If a Right square bracket needs to be a member of a character class, you can write it at the first character of the character class (if ^ is used, it is the second one) or use an escape character.

  • A character class matches a single character in the target string. this character must be one of the character sets defined in the character class, unless ^ is used to reverse the character class. If ^ needs to be a member of a character class, make sure it is not the first character of the class, or escape it.

  • Example:

    [Aeiou] // match all lowercase vowels [^ aeiou] // match all non-vowel characters [.?!] // Match the punctuation mark (. or? Or !)

    Note: ^ is only a convenient symbol for specifying characters that do not exist in the character class through enumeration. Instead of assertion, it will still consume one character from the target string, and if the current match point is at the end of the target string, the match will fail.

    It is easy to specify a character range, and range operations are sorted in ASCII order. They can be used to specify numbers for characters, such as [\ 000-\ 037]

    [0-9] // indicates that the meaning is exactly the same as '\ d' [a-z0-9A-Z _] // completely equivalent to' \ W' if only English is considered

    The following is a more complex expression \(? 0 \ d {2} [)-]? \ D {8}
    This expression can match phone numbers in several formats, such as (010) 88886666, 022-22334455, or 02912345678.
    Simple Analysis: first, it is an escape character \ (, it can appear 0 times or 1 time? And then a number 0 followed by two numbers \ d {2}, followed by one of) or-or "space", it appears 0 or 1 time, the last eight digits are \ d {8 }.

    7. Branch (|)

    The vertical line character is used as an optional path in the separation mode. For example, the mode gilbert | Sullivan matches "gilbert" or "sullivan ". The vertical bars can appear in any number of modes and allow available optional paths (matching empty strings ). Each optional path is tried from left to right for matching processing, and the first matching is successful. If the available path is in the sub-group (defined below), "successful match" means that the branches in the sub-mode and other parts in the main mode are matched at the same time.

    Let's look at an example \(? 0 \ d {2} [)-]? The regular expression \ d {8} can also match 010) 12345678 or (022-87654321) in an "incorrect" format. In fact, we can use the branch to solve this problem, as shown below:

    \ ({1} 0 \ d {2} \) {1} [-]? \ D {8} | 0 \ d {2} [-]? The expression \ d {8} matches the phone number of the three-digit area code. The area code can be enclosed in parentheses or not. The area code can be separated by a hyphen or space, or there is no interval.

    Note the order of each condition when using the branching condition.

    8. set internal options

    The matching results of regular expressions under different pattern modifiers may be different. Its syntax is :(? Modifier)

    For example ,(? Im. You can also use it to cancel these settings, such (? Im-sx) sets "PCRE_CASELESS" and "PCRE_MULTILINE", but cancels "PCRE_DOTALL" and "PCRE_EXTENDED" at the same time ". If a letter appears before-and after-, this option is disabled.

    The following is a simple example. For more information, click the internal option settings and mode modifier.

    Example:/AB (? I) c/only match "abc" and "abC"

    9. sub-group (sub-mode)

    Child groups are defined by parentheses and can be nested.

    Example:

    String: "the red king" regular expression :( (red | white) (king | queen) matching result: array ("red king", "red king", "red ", "king") Description: The first element is the result of the full pattern match, and the three elements that follow are matched by the three sub-groups in sequence. Their subscripts are 1, 2, and 3 respectively.

    We often need to use sub-groups for grouping, but do not need to capture them (separately. Which of the following strings follows the left parentheses defined by the child group? : The sub-group will not be captured independently and will not affect the calculation of the subsequent sub-group sequence number. For example:

    String: "the red king" regular expression :((? : Red | white) (king | queen) matching result: array ("red king", "red king", "king ")

    To facilitate shorthand, if you need to set options at the beginning of a non-capturing sub-group, the option letter can be located? And:, for example:

    (?i:saturday|sunday)(?:(?i)saturday|sunday)

    The above two methods are actually the same. Because the optional branch tries each branch from left to right, and the option is not reset before the child mode ends, and because the option setting affects other branches, the above pattern matches "SUNDAY" and "Saturday ".

    Let's look at a regular expression that matches the IP address (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?)
    Regular expression of IP addresses in related articles

    Conclusion

    The above describes common PHP regular expression syntaxes. some syntaxes are not detailed or involved, such as pattern modifiers, backward references, assertions, and recursive patterns. You can view the content in the PHP Manual.

    Tip:Generally, regular expression functions run less efficiently than string functions for the same function. If the application is relatively simple, a string expression is used. However, for tasks that can be executed using a single regular expression, it is incorrect if multiple string functions are used. ---- From PHP and MySQL Web open.

    References

    Http://php.net/manual/zh/book.pcre.php
    Https://msdn.microsoft.com/zh-cn/library/d9eze55x%28v=vs.80%29.aspx
    Http://deerchao.net/tutorials/regex/regex.htm
    Http://tool.chinaz.com/regex/
    Http://www.regexlab.com/zh/regref.htm

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.