Greedy quantifiers |
X? |
X, Neither once nor once |
X* |
X, Zero or multiple times |
X+ |
X, Once or multiple times |
X{N} |
X, ExactlyNTimes |
X{N,} |
X, At leastNTimes |
X{N,M} |
X, At leastNTimes, but no moreMTimes |
|
Reluctant quantifiers |
X?? |
X, Neither once nor once |
X*? |
X, Zero or multiple times |
X+? |
X, Once or multiple times |
X{N}? |
X, ExactlyNTimes |
X{N,}? |
X, At leastNTimes |
X{N,M}? |
X, At leastNTimes, but no moreMTimes |
|
Possessive quantifiers |
X? + |
X, Neither once nor once |
X* + |
X, Zero or multiple times |
X++ |
X, Once or multiple times |
X{N} + |
X, ExactlyNTimes |
X{N,} + |
X, At leastNTimes |
X{N,M} + |
X, At leastNTimes, but no moreMTimes |
ImportJava. util. arrays;
Public Class Test {
Public Static Void Main (string ARGs []) {
String t = " X123xxxxxx123 " ;
Pattern P = Pattern. Compile ( " . * 123 " );
Matcher m = P. matcher (t );
System. Out. println ( " ============= Greedy mode ============================ " );
While (M. Find ()){
System. Out. println ( " Start " + M. Start ());
System. Out. println (M. Group ());
System. Out. println ( " End " + M. End ());
}
Pattern p1 = Pattern. Compile ( " .*? 123 " );
Matcher M1 = P1.matcher (t );
System. Out. println ( " ============= Lazy mode ============================ " );
While (M1.find ()){
System. Out. println ( " Start " + M1.start ());
System. Out. println (m1.group ());
System. Out. println ( " End " + M1.end ());
}
Pattern p2 = Pattern. Compile ( " . * + 123 " );
Matcher m2 = P2.matcher (t );
System. Out. println ( " ============ Intrusion mode ============================== " );
While (M1.find ()){
System. Out. println ( " Start " + M2.start ());
System. Out. println (m2.group ());
System. Out. println ( " End " + M2.end ());
}
}
}
The output is as follows:
============= Greedy mode ============================
Start 0
X123xxxxxx123
End 13
============= Lazy mode ============================
Start 0
X123
End 4
Start 4
Xxxxxx123
End 13
============ Intrusion mode ==============================
Discussion:
Greedy Mode: It is used to read the entire string at a time. If it does not match, the rightmost character is spit out and then matched until the length of the matched string or string is 0. It aims to read as many characters as possible, so it returns immediately when it reads the first match.
In this example, the ". *" in the regular ". * 123" consumes the entire string first, so the matching is definitely not successful. Then, a character is removed from the matcher at a time, until "123" is displayed on the rightmost side. The match is successful and the search is stopped. Or, to make the entire expression match successfully, * "Although the entire expression can be read, it still takes the initiative to give up the" 123 "three character spaces that it can read. From this point, we can see that although the greedy mode is greedy, it is quite clear.
Lazy Mode : It starts from the left side of the string and tries not to read the characters in the string for matching. If it fails, it will read one more character and then match. This loop, if a matching string is found, the matching string is returned and then matched again until the string ends.
In this example, the Regular Expression ". *? ". *?" In 123 ", It tries to consume 0 to match the string, but fails, and has to read (consume) a character before matching. In this case, x123 meets the requirements and returns (X123), read again until the end At the end of the query, the xxxxxx123 match is found and returned. At this time, nothing is readable and the matching is ended.
This example also returns
Start 4
Xxxxxx123
End 13
Instead of the last x123, it indicates that for successful matching, it is also lazy. It is not lazy to read a series of X to make the subsequent 123 match.
Intrusion Mode : It is similar to the greedy pattern. The difference is that it will not vomit.
This In the example , In the regular expression ". * +" in ". * + 123", the entire string is consumed first. Therefore, the matching fails because it does not spit back, so the matching ends.
Conclusion: the nature of the "Laziness" model is "Laziness", and that of the "greedy" model is greedy, but when there are other partners around them, the "Laziness" model is not so lazy, the greedy mode is not so greedy.
Another point:
Start 0
X123
End 4
Start 4
Xxxxxx123
End 13
It can be seen that an end means the next start, which implies that the characters at the end position are not consumed (if consumed, the next start should be 5 ).
Pattern traps:
When an expression contains multiple non-greedy matches or multiple expressions with unknown matches, this expression may have a efficiency trap. Sometimes, the matching speed is incredibly slow, and you even begin to doubt whether the regular expression is practical.
Generation of efficiency traps:
"If there are few matches, the entire expression will fail to match. Similar to greedy mode, non-Greedy mode will be matched to a minimum to make the entire expression match successful ."
The specific matching process is as follows:
- "Non-greedy" first matches the minimum number of times, and then tries to match "expression on the right ".
- If the expression on the right matches successfully, the entire expression match ends. If the expression on the right fails to match, "non-Greedy part" adds a match and then tries to match "expression on the right ".
- If the expression on the right fails to match again, the "non-Greedy part" will be added again. Then try to match "expression on the right ".
- The final result of this type of push is "non-greedy" to make the entire expression match successful with as few matching times as possible. Or the final match still fails.
When multiple non-greedy matches exist in an expression, use the expression "d (\ W + ?) D (\ W + ?) Z "for example, for the" \ W +? "For example," d (\ W +?) on the Right ?) Z "belongs to its" expression on the right ", for" \ W +? "In the second bracket? "For example," Z "on the right belongs to it" expression on the right ".
When the "Z" match fails, the second "\ W +? "Adds a match" and then tries to match "Z ". If the second "\ W +? "No matter how" Increase the number of matches ", until the end of the entire text," Z "cannot match, it means" d (\ W + ?) Z "matching failed, that is, the first" \ W +? The "right" of "failed to match. At this time, the first "\ W +? "Adds a match and then performs" d (\ W + ?) Z. Loop the previous process until the first "\ W +? "No matter how" add matching Times ", the" d (\ W + ?) If none of Z matches, the entire expression fails to be matched.
In fact, in order to make the entire expression match successfully, greedy match will also properly "give up" the matched characters. Therefore, greedy matching is similar. When an expression has a large number of unknown matching times, to make the entire expression match successfully, each greedy or non-Greedy expression must try to reduce or increase the number of matching times, therefore, it is easy to form a large loop, resulting in a long matching time. This article is called a "trap" because this efficiency issue is often hard to detect.
Example: "d (\ W + ?) D (\ W + ?) D (\ W + ?) When Z "matches" ddddddddddd... ", it takes a long time to judge whether the matching fails.
Efficiency trap avoidance:
The principle to avoid efficiency traps is to avoid "Try matching" of "multiple loops ". It doesn't mean that non-Greedy matching is not good, but when using non-Greedy matching, you must avoid too many "loop attempts.
Case 1: There is no efficiency trap for only one non-greedy or greedy expression. That is to say, to match a text like "<TD> content </TD>", the expression "<TD> ([^ <] | <(?! /TD>) * </TD> "and" <TD> ((?! </TD>).) * </TD> "and" <TD> .*? </TD> "the efficiency is the same.
Case 2: If an expression contains multiple expressions with unknown matching times, do not try matching unless necessary.
For example, the expression "<script language = '(.*?) '> (.*?) </SCRIPT> "for example, if the first part of the expression matches successfully when" <script language = 'vbscript'> ", then the following "(.*?) </SCRIPT> "the matching fails, causing the first ".*? "Add matching times and try again. For the true purpose of the expression, let the first ".*? "Add a match to"Vbscript'>"Yes, so this kind of attempt is unnecessary.
Therefore, do not let the part of the number of unknown matches cross its boundary for expressions identified by boundaries. In the preceding expression, the first ".*? "It should be rewritten to" [^ '] * ". The ".*? "There is no expression for the number of unknown matches on the right, so this non-Greedy match has no efficiency trap. Therefore, the expression that matches the script block should be written as: "<script language = '([^'] *) '> (.*?) </SCRIPT> "better.