What is a regular expression
On the computer we often use (wildcards) to find the files we need, for example: *.doc, where the * stands for matching 0 or more characters. Regular expressions are also tools used for text matching, except that they are more powerful. Quote a sentence from the PHP manual: a Regular expression is a pattern that matches a target string from left to right, and most characters themselves represent a pattern that matches their own.
Here are a few simple examples to make a preliminary understanding of regular expressions.
Hi //Match English characters (ignoring case) Hi, Hi, Hi, hi
\bhi\b //Match the English word hi ' \b ' is a special character in the regular (an assertion) that represents the word boundary
\bhi\b.*\blucy\b //Match such as: ' Hi my name is Lucy ' . ' indicates that any character other than the line break ' is a quantifier, which means repeating 0 or more times
0\D{2}-\D{8} //matches such as: 020-12345678 ' \d ' matches a number (0-9) ' {n} ' repeated n times, such as {2} {8}
In the example above, \b. , *, \d, {2} have special meanings, as explained below.
PHP Regular syntax
1. Introduction
The two types of regular support in PHP are POSIX and PCRE, respectively. POSIX Regular expression extensions have been deprecated since PHP 5.3.0. So the following discussion is based on the PCRE model. You can click to see what's different from POSIX regular expressions and what's different from Perl.
2. Separators
When using the PCRE function, the pattern needs to be enclosed by a delimiter. The delimiter can make any non-alphanumeric, non-backslash, non-whitespace character. Frequently used delimiters are forward slash/, hash symbol #, and inverse symbol ~. The following example is a pattern that uses a valid delimiter.
/foo bar/#^[^0-9]$#+php+%[a-za-z0-9_-]%
If a delimiter needs to be matched within a pattern, it must be escaped with a backslash. If delimiters often appear within a pattern, a better choice is to use other separators to improve readability. Cases:
/http:\/\//#http://#
3. Meta-Characters
The power of a regular expression is derived from its ability to choose and repeat in a pattern. Some characters are given special meanings so that they are no longer purely representative of themselves, and this coded character in the pattern is called a meta-character .
There are two different metacharacters: one can be used anywhere outside of the pattern brackets, and the other one needs to be used inside square brackets.
The metacharacters used outside the square brackets are as follows:
Code description
/ |
Generally used to escape characters |
^ |
Assert the starting position of the target (or the beginning of a line in multiline mode) |
$ |
Assert the end position of the target (or end of line in multi-line mode) |
. |
Match any character except line break (default) |
[ |
Start character class definition |
] |
End character class definition |
| |
Start an optional branch |
( |
The start tag of the child group |
) |
The end tag of the child group |
? |
A: As a quantifier, it represents 0 or 1 matches. B: The greedy character that is used to change quantifiers after quantifiers. |
* |
quantifier, 0 or more matches |
+ |
quantifier, 1 or more matches |
{ |
Custom quantifier start tag |
} |
Custom quantifier end Tag |
The part of the pattern in brackets is called the "character class". Only the following meta characters are available in a character class:
Code description
\ |
Escape character |
^ |
Indicates that the character class is reversed only when it is the first character (in square brackets) |
- |
Mark Character Range |
Example:
\ba\w*\b matches a word that begins with the letter A, first \b at the beginning of a word, then the letter A, then any number of any word characters (the word character refers to any letter, number, underscore) \w*, and finally the end of the word \b.
\d+ matches 1 or more consecutive digits.
^\d{5,12}$ matches 5-bit to 12-bit numbers, because ^ and $ are used, so the entire string entered is used to match \d{5,12}, meaning that the entire input must be 5 to 12 digits.
4. Escape sequence (backslash)
Backslash \ has four usages, detailed clickable escape sequence (backslash)
"1" is an escape character, for example, if you want to match a * character, you need to write it as \* in the pattern. This applies if a character has special meanings without escaping. However, for non-alphanumeric characters, it is safe to declare that it represents itself by adding a backslash in front of it when it needs to match the original. If you want to match a backslash, use \ \ in the pattern.
Backslashes have special meanings in both single-quote strings and double-quote strings, so to match a backslash, the pattern must be written as \\\\. The reason: first it acts as a string, and the backslash is escaped. The last regular expression engine also considers the backslash to be escaped. Therefore, 4 backslashes are required to match a backslash.
"2" provides a means of controlling the visible encoding of nonprinting characters
"3" is used to describe a particular character class
Code description
\d |
arbitrary decimal digits |
\d |
Any non-decimal number |
\h |
Any horizontal whitespace character (since PHP 5.2.4) |
\h |
Any non-horizontal whitespace character (since PHP 5.2.4) |
\s |
Any whitespace character |
\s |
Any non-whitespace character |
\v |
Any vertical whitespace character (since PHP 5.2.4) |
\v |
Any non-vertical whitespace character (since PHP 5.2.4) |
\w |
Any word character, word character refers to any letter, number, underline. |
\w |
Any non-word character |
"4" Some simple assertions. An assertion specifies a condition that must be matched at a particular location, and they do not consume any characters from the target string. Backslash assertions include:
\b Word boundaries
\b Non-word boundaries
\a the starting position of the target (independent of multiline mode)
\z (independent of Multiline mode) where the target ends or ends
\z end position of the target (independent of multiline mode)
\g first match position in target
5. Repetition/quantifier
Code description
* |
Repeat 0 or more times, equivalent to |
+ |
Repeat one or more times, equivalent to |
? |
Repeat 0 or one time, equivalent to |
N |
Repeat n times |
{N,} |
Repeat N or more times |
{N,m} |
Repeat N to M times |
Quantifiers are "greedy" by default, meaning that they will match as many characters as possible (up to the maximum number of matches) without causing the pattern match to fail. However, what if a quantifier follows one? tag, it becomes lazy (non-greedy) mode, it no longer matches as much as possible, but matches as little as possible.
Take a look at the example below to understand how "greedy" and "non-greedy" patterns are going.
For the string "aatest1bbtest2cc" Regular expression ". *" match result "test1bbtest2" regular expression ". *?" Match result "test1"
For more "greedy" and "non-greedy" modes, see http://php.net/manual/zh/regexp.reference.repetition.php
6. Character class (square brackets)
Description in the PHP manual:
The opening parenthesis begins with a description of the character class and ends with square brackets. A separate right parenthesis does not have a special meaning. If a right parenthesis is required as a member of a character class, it can be written in the first character of the prompt (if the ^ is reversed, then the second) or the escape character is used.
A character class matches a single character in the target string, which must be one of the character sets defined in the character class, unless a ^ is used to reverse the character class. If ^ needs to be a member of a character class, make sure it is not the first character of the character class, or escape it.
Example:
[Aeiou] Matches all lowercase vowel letters [^aeiou] //matches all non-vowel characters [.?!] Match punctuation (. or? or!)
Note: ^ is just a handy sign specifying those characters that do not exist in the character class by enumerating. Instead of asserting, it will still consume one character from the target string, and if the current match point is at the end of the target string, the match will fail.
Easily specify a range of characters, with the range manipulation sorted in ASCII collation. They can be used to specify values for characters, such as [\000-\037]
[0-9] The meaning of the representation is exactly the same as ' \d ' [a-z0-9a-z_] //is exactly equivalent to ' \w ' if only the English language is considered
The following is a more complex expression \ (? 0\d{2}[)-]?\d{8}
This expression can match phone numbers in several formats, such as (010) 88886666, or 022-22334455, or 02912345678.
Simple analysis: First is an escape character \ (, it can occur 0 or 1 times?, then a number 0, followed by 2 digits \d{2}, then a) or-or a "space" in one, it appears 0 or 1 times, and finally 8 digits \d{8}.
7. Branch (|)
The vertical bar character is used to detach the optional path in the pattern. Like pattern gilbert|. Sullivan matches "Gilbert" or "Sullivan". The vertical bar can have any number of occurrences in the pattern, and allows for an optional path that is empty (matches an empty string). The matching processing attempts each of the optional paths from left to right, and uses the first successfully matched one. If the optional path is in a subgroup (defined below), a successful match means that both the branch in the sub-pattern and the other part of the main mode are matched.
Look back at an example above \ (? 0\d{2}[)-]?\d{8} This regular can also match 010) 12345678 or (022-87654321 such an "incorrect" format. In fact, we can use the branch to solve this problem, as follows:
\ ({1}0\d{2}\) {1}[-]?\d{8}|0\d{2}[-]?\d{8} This expression matches the phone number of the 3-bit area code, where the area code can be enclosed in parentheses, or not, the area code and the local number can be separated by hyphens or spaces, or there can be no interval.
When using branching conditions, be aware of the order of each condition
8. Internal option settings
It is possible that regular expressions may not match the same results under different pattern modifiers. Its syntax is:(? modifier)
For example, the (? IM) setting indicates a multiple-line case-insensitive match. You can also use it to cancel these settings, such as "pcre_caseless", "Pcre_multiline", but also the "Pcre_dotall" and "pcre_extended" IM-SX. This option is deselected if a letter appears before-and also appears in-after.
The following is a quick example of a simple sample, you want to learn more clickable internal option settings and pattern modifiers
Example:/ab (? i) c/only matches "ABC" and "ABC"
9. Sub-Group (sub-mode)
Subgroups are delimited by parentheses, and they can be nested.
Example:
String: "The red King" Regular Expression: ((Red|white) (King|queen)) matches the result: Array ("Red King", "Red King", "Red", "King") Description: of which the No. 0 Elements are the result of an entire pattern match, followed by three elements, followed by three subgroups of matching results. Their subscripts are 1, 2, 3, respectively.
Often we have a need to group with subgroups, but not to capture them (individually). Immediately following the left parenthesis defined by the subgroup?: Causes the subgroup to not be captured separately, and does not affect the calculation of the subsequent subgroup ordinal. For example:
String: "The red King" Regular expression: ((?: Red|white) (king|queen)) matching result: Array ("Red King", "Red King", "King")
For easy shorthand, if you need to set options at the start of a non-capturing subgroup, the option letter can be located? And: Between, for example:
(? i:saturday|sunday) (?:(? i) saturday|sunday)
The above two formulations are actually the same pattern. Because the optional branch tries each branch from left to right, and the option is not reset before the end of the sub-mode, and because the options are set to penetrate through the other branches later, the above pattern will match "SUNDAY" and "Saturday".
Then look at a match for the IP address (2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?]
Related articles regular expressions for IP addresses
Conclusion
The syntax that is commonly used in regular expressions in PHP is described above, and some of the syntax is not in detail and related, such as: pattern modifiers, back references, assertions, recursive patterns, and so on. You can view the content in PHP manual.
tip: in general, for the same functionality, regular expression functions run less efficiently than string functions. If the application is simpler, use a string expression. However, for tasks that can be performed with a single regular expression, it is not right to use multiple string functions. ----from the book "PHP and MySQL Web open."
Resources
http://php.net/manual/zh/book.pcre.php
Https://msdn.microsoft.com/zh-cn/library/d9eze55x%28v=vs.80%29.aspx
Http://deerchao.net/tutorials/regex/regex.htm
http://tool.chinaz.com/regex/
Http://www.regexlab.com/zh/regref.htm