In some applications, some strings are often matched (searched) according to certain rules. for example, the user is required to enter a QQ number with at least five digits. The tool used to describe these rules is a regular expression. Introduction to regular expressions
In some applications, some strings are often matched (searched) according to certain rules. for example, a user is required to enter a QQ number with at least five digits. The tool used to describe these rules is a regular expression.
Simplest matching
The simplest match is to directly give a character match. If you use character a to match aabab, three results will be matched, which are the first, second, and 4th characters in the string. This matching is the simplest case, but it is often complicated in actual processing. the following "QQ number is a number with at least five digits" regular expression is:
- ^d{5,}$
This regular expression describes the number that must be determined to be at least five digits. Let's take a look at how the expression describes this rule:
- ^: Indicates the start of matching a string, that is, the string is an independent start rather than contained in a string.
- D: indicates matching numbers.
- {5 ,}: indicates that at least five or more matches are matched.
- $: Indicates the end of the matched string, that is, the string is an independent end.
It is clear now that this regular expression is combined to match consecutive numbers with more than five digits and has an independent start and end. for numbers with less than five digits, or a123456b, which does not start or end with a number, is invalid.
From this example, we can see that the regular expression is described from left to right.
Similarly, if you want to match a mobile number, the regular expression is:
- ^1d{10}$
Prompt
Because the matching results of regular expressions are not so definite in many cases, it is recommended to download some auxiliary tools to test the matching results of regular expressions. Such tools include Match Tracer, RegExBuilder, and other similar tools.
Metacharacters
In the above example, the symbols ^, d, and $ represent specific matching meanings. We call them metacharacters. common metacharacters are as follows:
Metacharacters |
Description |
. |
Match any character unexpected except the line break |
W |
Match letters, numbers, or underscores |
S |
Match any blank space character |
D |
Matching number |
B |
Start or end of a matching word |
^ |
Start of matching string |
$ |
End of matching string |
[X] |
Matches x characters, for example, a, B, and c characters in a string. |
W |
It matches any character other than letters, numbers, underscores, and Chinese characters. |
S |
S, that is, matching any non-blank characters |
D |
D. it matches any non-numeric characters. |
B |
B. It is not the start or end position of a word. |
[^ X] |
Matches any character except x, for example, [^ abc] matches any character other than abc. |
Prompt
- When we want to match these metacharacters, we need to use the character escape function, which is also used in regular expressions to indicate escape, such as matching. ., otherwise. it is interpreted as "any character except line break ". Of course, to match, you must write it \
- Consecutive numbers or letters can be connected with-symbols, such as matching all lower-case letters. [1-5] matches the numbers 1-5.
Repeated
The power of a regular expression lies in its ability to include selection and loops in the pattern. Regular expressions use repeated rules to express loop matching.
The common repetition is as follows:
Repeated |
Description |
* |
Repeated zero or more times |
+ |
Repeat once or more times |
? |
Zero or one repetition |
{N} |
Repeated n times |
{N ,} |
Repeat n or more times |
{N, m} |
Repeat n to m times |
Branch
A branch is used to create several rules. if any rule is satisfied, it is regarded as a successful match. Specifically, the | symbol is used to separate various rules and the condition is matched from left to right.
Prompt
As defined by the branch, as long as the matching succeeds, the following conditions are no longer matched. Therefore, if you want to match the content with the inclusion relationship, pay attention to the rule order.
The following is an example of using a branch.
The postal code in the United States uses five or five numbers to connect to four numbers, such as 12345 or 54321-1234. to match all zip codes, the correct regular expression is:
- D {5}-d {4} | d {5}
- // Incorrect syntax
- D {5} | d {5}-d {4}
The following error code can only match the first five digits of five digits and nine digits, but cannot match the zip code of nine digits.
Group
In regular expressions, rules can be enclosed in parentheses as groups, and groups can be viewed as metacharacters.
For example, verify the IP address:
- (d{1,3}.){3}d{1,3}
This is a simple and imperfect regular expression for matching IP addresses, because it can match non-existent IP addresses such as 322.197.578.888 in addition to the correct IP address.
Of course, after a simple match with this expression is successful, you can use the arithmetic comparison of PHP to determine whether the IP address is correct. The regular expression does not provide the arithmetic comparison function. to completely match the correct IP address, you need to improve the following:
- ((25[0-5]|2[0-4]d|[01]?dd?).){3}(25[0-5]|2[0-4]d|[01]?dd?)
Rule description
The key of this rule is to determine that the IP address ranges from 0 to 255, and then repeat it four times. In:
- 25[0-5]|2[0-4]d|[01]?dd?
First, 250-255 and 200-249 are determined by the branch. [01]? Dd? Then the range of 0-199 is determined, and the sum is 0-255.
Greed and laziness
By default, regular expressions match as much content as possible when matching conditions are met. For example, if a. * B is used to match aabab, it will match the entire aabab instead of the aab. this is greedy match.
Matching with greedy matches less content as much as possible when matching conditions are met, which is a lazy match.
In the preceding example, the following rule is used:
- a.*?b
If you use this expression to match the aabab, the aab and AB matching results will be obtained.
Common lazy delimiters are as follows:
Lazy qualifier |
Description |
*? |
Repeat any time, but as few as possible |
+? |
Repeat once or more times, but as few as possible |
?? |
Repeated 0 or 1 times, but as few as possible |
{N ,} |
Repeated more than n times, but as few as possible |
{N, m} |
Repeat n to m times, but as few as possible |
Pattern modifier
The pattern modifier is marked out of the entire regular expression. it can be seen as a supplement to the regular expression.
Common pattern modifiers are as follows:
Pattern modifier |
Description |
I |
The characters in the mode match both uppercase and lowercase letters. |
M |
Strings are considered as multiple rows |
S |
The string is treated as a single line, and the linefeed is used as a common character. |
X |
Ignore the white space in the mode |
E |
The preg_replace () function replaces the reverse reference in the replacement string as a normal replacement, uses it as the PHP code value, and uses the result to replace the searched string. |
A |
Force match from the beginning of the target string |
D |
In the mode, the $ metacharacters only match the end of the target string. |
U |
Match the nearest string |
U |
Pattern strings are treated as UTF-8 |