Metacharacters [and] are used to define a character set combination, which means that they must match one of the characters in the set.
W: any letter, digit (case-sensitive) or underscore (equivalent to [a-zA-Z0-9 _]), can only match a single character.
W: any non-alphanumeric or non-underline character.
S: any blank character (equivalent to [fnrtv]).
S: any non-blank character (equivalent to [^ fnrtv]).
Https? This regular expression indicates that it can match both http and https ,? The meaning here is: the character (s) in front of me either does not appear or can appear at most once.
{} Braces. The last usage is to give a minimum number of repetitions (but not a maximum value ). for example, {3,} indicates that at least three times are repeated. The equivalent statement is that "three or more times must be repeated ".
Lazy matching method :*? (It is a * lazy version ).
Common greedy metacharacters and their lazy versions
Greedy metacharacters
Lazy metacharacters
*
*?
+
+?
{N ,}
{N ,}?
Boundary limitations
For example, cat can match cat, but it can also match category. If we only want to match cat, we only need to add bcatb. (B indicates boundary (boundary). If it does not match the boundary of a word, you can use B.
. * Indicates any matching character (zero or multiple times of occurrence)
The method for matching the JavaScript annotation line is as follows:
(? M) ^ s * //. * $
Explanation:
(? M) single row matching
^... $ Matched start and end symbols
S * matches zero or multiple white spaces
// Match the annotator
. * Match any character after the annotation symbol
The sub-expression is (and), which is regarded as an independent element. Let's look at the regular expression matching IP addresses. The general matching method is as follows:
D {1, 3}. d {1, 3}. d {1, 3}. d {1, 3}
Simplified as follows:
(D {1, 3}.) {3} d {1, 3}
But there will be problems, that is, 999.999.999.999 will also match, this is incorrect, the correct IP address should have the following features:
One or two: This can be matched randomly (d {1, 2 })
Three digits, but starting with 1: 1d {2}
It must start with [0-4] d.
Three digits, but starting with 25: 25 [0-5]
Then, we will combine the above rules:
The code is as follows: |
Copy code |
(D {1, 2}) | (1d {2}) | (2 [0-4] d) | (25 [0-5]).) {3} (d {1, 2}) | (1d {2}) | (2 [0-4] d) | (25 [0-5]) |
Backtracking matching
Suppose we want to find out which characters are repeated in a piece of text. How should we express them?
The text content is as follows:
This is a text, and I want to know if there is any repeated words in it.
The matching regular expression is as follows:
[] + (W +) [] + 1
In this way, we can successfully find the expected results, as shown below:
Is
Know
How does it achieve this? [] + Matches one or more spaces, w + matches one or more alphanumeric characters, and [] + matches subsequent spaces. Note that w + is enclosed in brackets and is a subexpression. This subexpression is not used for repeated Matching. Here, duplicate matching is not involved. This subexpression only separates a part of the entire pattern for later reference. The last part of this pattern is 1, which is a backtracking reference, it references the subexpression defined above: When (w +) matches the word is, 1 also matches the word is; when (w +) when the word know is matched, 1 also matches the know.
So what exactly does 1 represent? It represents the first subexpression in the pattern. 2 represents the second subexpression, 3 represents the third subexpression, and so on. In the above example, [] + (w +) [] + 1 will repeat the same word twice.
========================================================== ==============================================
Basic metacharacters
.
Match any single character
|
Logic or operator
[]
Match a character in a character set
[^]
Evaluate the combination of character sets
-
Define a range (for example, A-Z)
Escape the next character
Number of metacharacters
*
Matches zero or multiple times of the previous character (subexpression)
*?
* Lazy version
+
Matches one or multiple times of the previous character (subexpression)
+?
+ Lazy version
?
Match the first character (subexpression) zero or duplicate
{N}
Match n times of the previous character (subexpression)
{M, n}
Match the previous character (sub-expression) at least m times and at most n times
{N ,}
Match the previous character (subexpression) n times or more times
{N ,}?
{N,} lazy version
Location metacharacters
^
Match the start of a string
A
Match the start of a string
$
End of matching string
Z
End of matching string
<
Match the start of a word
>
End of matching word
B
Match the word boundary (start and end)
B
Negative effect of B
Special metacharacters
[B]
Escape character
C
Match a control character
D
Match any number character
D
Sense of d
F
Page feed
N
Line break
R
Carriage return
S
Match a blank character
S
S's antsense
T
Tab (Tab character)
V
Vertical Tab
W
Match any letter, digit, or underscore
W
Assense of w
X
Matches a hexadecimal number.
O
Matches an octal number.
Retrospective reference and search
()
Define a subexpression
1
Matches the first subexpression, 2 represents the second subexpression, and so on.
? =
Forward lookup
? <=
Backward search
?!
Negative forward lookup
? <!
Negative backward lookup
? ()
Condition (if then)
? () |
Condition (if then else)
Case-sensitive conversion
E
End L or U conversion
L
Convert the next character to lowercase
L
Converts the subsequent characters to lowercase letters until E is met.
U
Convert the next character to uppercase
U
Converts the subsequent characters into uppercase letters until E is met.
Matching mode
(? M)
Branch matching mode
Forward lookup
In terms of syntax, a forward search mode is actually? = The child expression that starts with =. The text to be matched follows =.
For example, if you need to know whether the URL uses http or https, you can use the forward search:
Http://www.111cn.net
Regular Expression Matching:. + (? = :)
Result: http
If. + (:) is used, http:
Backward search
That is, to find the character that appears before the matched text (but does not consume it), the operator is? <=
The text is ABC0: $12.56.
Matching :(? <= $) [0-9.] +
If not? <=, The result is $12.56, and vice versa is 12.56
Forward and backward query set
For example, the following text:
<Head>
<Title> Ben Forta's HomePage </title>
</Head>
If you want to obtain the content in the <title> and </title> tags, but do not include the <title> and </title> tags, it will appear abnormal and troublesome. By using forward and backward matching, you only need a regular expression:
Regular Expression Matching :(? <= <Title> ).*? (? = </Title>)
The forward and backward lookup just mentioned should be called the forward and backward lookup. Of course, the negative forward search and negative backward search exist here:
Operator
Description
(? =)
Forward lookup
(?!)
Negative forward lookup
(? <=)
Forward lookup
(? <!)
Negative backward lookup
Negative backward lookup
The text is: I paid $30 for 100 apples.
Matched to: B (? <! $) D + B
This indicates finding a number without the $ symbol. The matching result is 100.
Of course, the negative forward lookup method is similar to this method.
Embedding conditions in regular expressions
It is powerful but not frequently used in regular expressions. The so-called condition is to use? . The embedding conditions are similar to the following two situations:
Process conditions based on a backtracing reference
Process conditions based on a forward and backward query
First, for the backtracing reference condition: if you need to find all the labels in a piece of text, not only that, but if the entire label, is A link (included between the <A> and </A> tags), you also need to match the entire link tag.
What is the syntax used to define this condition? (Backreference) true-regex, where? This indicates that this is a condition. The backreference in parentheses is a backtracing reference, and true-regex is an expression that is executed only when the backreference exists. For example:
The code is as follows: |
Copy code |
<! -- Nav Bar --> <TD> <A href = '/home'> </a> <A href = '/search'> </a> <A href = '/help'> </a> </Td> |
The regular expression is:
The code is as follows: |
Copy code |
(<[Aa] s + [^>] +> s *)? <[Ii] [Mm] [Gg] s + [^>] +> (? (1) s * </[Aa]>) |
Explanation:
The code is as follows: |
Copy code |
(<[Aa] s + [^>] +> s *)? Is to match A dispensable <a> or <A> tag <[Ii] [Mm] [Gg] s + [^>] +> match a and its syntax for any attribute. |
(? (1) s * </[Aa]>) is a backtracing reference condition. Where? (1) meaning: If the first backend reference (specific to the instance, that is, the <A> tag) exists, then use s * </[Aa]> to continue the matching (in other words, only the <A> tag of the current table is matched successfully ). If (1) exists, s * </[Aa]> matches any blank characters after the end tag </A>.