The true understanding of RE begins with this article. RE is really profound and profound, and I have sorted out and recalled what I learned.
A regular expression is the code that records text rules..
As mentioned above, RE is a character arrangement rule described, which has two elements:
1,Expression Form:
The expression includes the subject and combination.
A)Subject: Various characters, no matter whether the character of a regular expression represents one or more types of characters. In short, these subjects can be seen
B)Combination Method: That isLocation, The same character, location is different, arrange different, represents the string is also different
Therefore, the content of the expression can be seen directly from the RE matching results.
. |
Match linefeedAny character other than "\ n" |
\ B |
Indicates the start or end of a word, that is,Division [\ B matches the following position: its first character and the last character are not all \ w] Word in RE: not less than a continuous \ w |
\ D |
One digit(0, 1, 2 ...... 9) |
\ S |
Any blank space character: Space, tab, line break, Chinese fullwidth Space |
\ W |
Letters, numbers, underscores, or Chinese Characters |
^ |
StringStart |
$ |
StringEnd |
+ |
Match the previous content1 time or multiple times |
* |
Number: the content on the front can be reused continuously.Any timeIs the entire expression match,0 or multiple times |
? |
Repeated0 or 1 time |
{N} |
RepeatedN times |
{N ,} |
RepeatedN times or times |
{N, m} |
RepeatedN to m times |
\ |
When a special symbol is matched, the special meaning of the canceled symbol is :\--\\,*--\*,.--\.,(--\(,)--\) |
[Aeiou] |
Matches one character. The candidate value is aeiou. |
[0-9] |
Matches a number. The candidate value is 0, 1, 2, 3, 4, 5, 6, 8, 9. |
Antsense: Upper case indicates the opposite meaning of lower case
\ W |
Any character that is not a letter, number, underline, or Chinese Character |
\ S |
Any non-null characters |
\ D |
Any non-Numeric Character |
\ B |
It is not the start or end position of a word. |
[^ X] |
Any character except x |
[^ Aeiou] |
Any character except aeiou |
|
2,Expression:
In terms of expressions, how to reasonably, accurately, and briefly describe rules is the scope of expressions, such: branch, group, Back Reference, zero-width assertion, greedy and lazy, recursive matching, etc.
Some people may say that the things mentioned in the previous line are also directly reflected in the results. What I'm talking about is,ExpressionThe use of various methods makes the expression more concise and accurate. In fact, some methods are useless. For example, backward reference is a clear example.
Branch Condition |
Use"|Separate different rules Eg. > Domestic fixed telephone:0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7}Separated by "-", the three-digit and four-digit area numbers > China's fixed telephone, with a three-or four-digit area code. The area code can be separated by a hyphen (-), a space, or nothing: \(? 0 \ d {2 }\)? [-]? \ D {8} | \(? 0 \ d {3 }\)? [-]? \ D {7} Bug:No three or four-digit area codes can be assigned to 01234567890; no matching is successful for 012-34567890 (\ D {11}) | ^ (\ d {7, 8}) | (\ d {4} | \ d {3})-(\ d {7, 8 }) | (\ d {4} | \ d {3})-(\ d {7, 8 }) -(\ d {4} | \ d {3} | \ d {2} | \ d {1}) | (\ d {7, 8 }) -(\ d {4} | \ d {3} | \ d {2} | \ d {1}) $)
Supports mobile phone numbers, 3-4 area codes, 7-8 live video numbers, and 1-4 extension numbers. > U.S. Postal code: \ d {5}-\ d {4} | \ d {5} 9 digits. The first five digits are separated by "-", or only five digits are allowed. (Note the order of each conditionIf it is \ d {5} | \ d {5}-\ d {4}, the value matches the first five digits of the zip code or bit, if the matching is completedLazyPrinciples) |
Group |
Repeat a group of characters()To specifySubexpressionGroup Eg. > IP Address: (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) |
Backward reference |
After the subexpression (group), useNo.ComeReferenceIn the preceding group, the number of the added group starts from 1 by default, and \ 1 indicates the text matched by Group 1. Group 0 matches the entire regular expression. Eg. Repeated words, such as "go" (\ B (\ w +) \ B) \ s + \ 1 \ B You can specify the group name for the group by yourself :(? <GroupName> expr) or (? 'Groupname' expr) [During group number matching, scan both sides: 1. Scan unnamed groups; 2. Scan named groups] Eg. > IP Address: ((? <IP> 2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} \ k <IP> |
Assertion with Zero Width |
"Zero Width" indicates that this syntax does not occupy any character in the matching string. "Assertion" indicates that exit if the condition is not met Make sure that some characters are near the matching string (? = Exp)Zero-width positive prediction first asserted: The expression exp can be matched after the location where the assertions appear. The matching string is followed by exp Eg. > Except for ing, the word ending with ing in an article: \ B \ w + (? = Ing \ B ?) Rolling in the deep it matches Roll (? <= Exp)When the blank width is positive, the system determines that the expression exp can be matched in front of the location where the assertions appear. The matching string is preceded by exp. Eg. > Match the parts except re in the words starting with re: (? <= \ Bre) \ w + \ B Reading a book that matches ading > Numbers separated by blank spaces (excluding these blank spaces) (? <= \ S) \ d + (? = \ S) |
Assertion with negative Zero Width |
Make sure there are no characters near the matching string (?! Exp)Zero-width negative prediction first asserted: The expression exp cannot be matched after this position The matching string cannot be followed by exp Eg. > The word is not followed by q of the letter u: \ B \ w * q (?! U) \ w * \ B (? <! Exp)Zero-width negative review post-asserted: The expression exp cannot be matched before the asserted position The matching string cannot start with exp. Eg. The first seven digits are not lower-case letters: (?! <[A-z]) \ d {7} Simple HTML tags without attributesLiContent: (? <= <\ W +> ).*(? = <\/\ 1>) |
Note |
(? # Comment) Eg. IP Address: ((? <IP> 2 [0-4] \ d (? #200-249) | 25 [0-5] (? #250-255) | [01]? \ D? (? #0-199) \.) {3} \ k <IP> |
Greedy |
When RE contains a qualified qualifier that can accept duplicates, it usually matchesAs many as possible. Eg. A. * BMatches the longest string that starts with a and ends with B. Aabab will match the entire string |
Laziness |
MatchAs few as possibleCharacter Add the following separator to the front?It can be converted to the lazy mode. Eg. A .*? BThe matching results for aabab are aab and AB.
| *? |
Repeated 0 or multiple times, but as few as possible |
| +? |
Repeat once or multiple times, but as few as possible |
| ?? |
Repeated 0 or 1 times, but as few as possible |
| {N, m }? |
Repeat n to m times, but as few as possible |
| {N ,}? |
Repeated more than n times, but as few as possible |
|
First match |
The first match has a higher priority than greedy or lazy. |
|
The above thinking may not be so rigorous. I just want to explain how I understand this set of things.