1. Recursive matching of expressions
Sometimes, we need to use regular expressions to analyze the matching conditions of parentheses in a formula. For example, use the expression "\ ([^)] * \)" or "\(.*? \) "Can match a pair of parentheses. However, if there is still a layer of parentheses in the brackets, such as "()", this method cannot be correctly matched and the result is "(()". Similarly, HTML supports nested tags, such as "<font> </font>. This section will discuss how to match nested pairs of parentheses or pairs of tags.
Match the nesting of unknown layers:
Some regular expression engines provide special support for such nesting. In addition, the stack space can support nesting at any unknown level, such as Perl, PHP, and Greta. In PHP and Greta, the expression uses "(? R) "to indicate the nested part.
The expression for matching the nested "Parentheses" with unknown layers is written as follows: "\ ([^ ()] | (? R ))*\)".
[Sample code for Perl and PHP]
Match finite-level nesting:
For a regular expression engine that does not support nesting, you can only match a finite hierarchy of nesting. The idea is as follows:
Step 1: WriteNesting is not supported"\ ([^ ()] * \)", "<Font> ((?! </? Font>).) * </font> ". When the two expressions match the nested text, they only match the innermost layer.
Step 2: WriteMatching nested Layer"\ ([^ ()] | \ ([^ ()] * \)". When the number of nested layers is greater than one time, this expression can only match the innermost two layers. At the same time, this expression can also match non-nested text or nested innermost layer.
Match the "<font>" tag of the nested layer. The expression is: "<font> ((?! </? Font>). | (<font> ((?! </? Font>).) * </font> ". This expression matches only the two layers in the innermost part of the text when the number of nested layers is greater than one.
Step 3: Find the relationship between the expressions matching the nested (n) layer and the expressions of the nested (n-1) layer. For example, the expression that can match the nested (n) layer is:
[Flag header] ([Match expressions other than [Mark header] and [Mark end] | [Match expressions at N-1 layer]) * [Mark end]
Let's look back at the previously written expression "matching nested layers:
|
\( |
( |
[^ ()] |
| |
\ ([^ ()]) * \) |
)* |
\) |
<Font> |
( |
(?! </? Font> ). |
| |
(<Font> ((?! </? Font>).) * </font>) |
)* |
</Font> |
|
|
|
|
|
|
|
PHP and Greta are simple in that the expressions matching the nested (n-1) layer are (? R) indicates: |
\( |
( |
[^ ()] |
| |
(? R) |
)* |
\) |
Step 4: You can write expressions that match the finite (n) layer. Although the expression written in this way looks long, the matching efficiency is still very high after the expression is compiled.
2. Non-Greedy matching efficiency
There may be many people like me who have had such experiences: when we want to match text similar to "<TD> content </TD>" or "[B] Bold [/B ]",Forward pre-SearchFunction to write the following expression: "<TD> ([^ <] | <(?! /TD>) * </TD> "or" <TD> ((?! </TD>).) * </TD> ".
When you findNon-Greedy matchThe expression of the same function can be written so easily: "<TD> .*? </TD> ". If time passes, try to use simple non-Greedy match ".*? ". Especially for complex expressions, non-Greedy match ".*? "The written expressions are indeed much more concise.
However, when multiple non-greedy matches exist in an expression, or multipleNumber of unknown matchesThis expression may have a efficiency trap. Sometimes, the matching speed is incredibly slow, and you even begin to doubt whether the regular expression is practical.
Efficiency traps:
In the basic article on this site, the description of non-Greedy match said: "If few matches will cause the entire expression to fail to match, it is similar to the greedy pattern, non-Greedy pattern will be matched to a minimum to make the entire expression match successfully."
The specific matching process is as follows:
- "Non-greedy" first matches the minimum number of times, and then tries to match "expression on the right ".
- If the expression on the right matches successfully, the entire expression match ends. If the expression on the right fails to match, "non-Greedy part" adds a match and then tries to match "expression on the right ".
- If the expression on the right fails to match again, the "non-Greedy part" will be added again. Then try to match "expression on the right ".
- The final result of this type of push is "non-greedy" to make the entire expression match successful with as few matching times as possible. Or the final match still fails.
When multiple non-greedy matches exist in an expression, use the expression "d (\ W + ?) D (\ W + ?) Z "for example, for the" \ W +? "For example," d (\ W +?) on the Right ?) Z "belongs to its" expression on the right ", for" \ W +? "In the second bracket? "For example," Z "on the right belongs to it" expression on the right ".
When the "Z" match fails, the second "\ W +? "Adds a match" and then tries to match "Z ". If the second "\ W +? "No matter how" Increase the number of matches ", until the end of the entire text," Z "cannot match, it means" d (\ W + ?) Z "matching failed, that is, the first" \ W +? The "right" of "failed to match. At this time, the first "\ W +? "Adds a match and then performs" d (\ W + ?) Z. Loop the previous process until the first "\ W +? "No matter how" add matching Times ", the" d (\ W + ?) If none of Z matches, the entire expression fails to be matched.
In fact, to make the entire expression match successfully,Greedy matchThe matched characters will also be "exported. Therefore, greedy matching is similar. When there are manyNumber of unknown matchesIn order to make the entire expression match successfully, all greedy or non-Greedy expressions must try to reduce or increase the number of matching times, thus forming a large loop, this results in a long matching time. This article is called a "trap" because this efficiency issue is often hard to detect.
Example: "d (\ W + ?) D (\ W + ?) D (\ W + ?) When Z "matches" ddddddddddd... ", it takes a long time to judge whether the matching fails.
Efficiency trap avoidance:
The principle to avoid efficiency traps is to avoid "Try matching" of "multiple loops ". It doesn't mean that non-Greedy matching is not good, but when using non-Greedy matching, you must avoid too many "loop attempts.
Case 1: There is no efficiency trap for only one non-greedy or greedy expression. That is to say, to match a text like "<TD> content </TD>", the expression "<TD> ([^ <] | <(?! /TD>) * </TD> "and" <TD> ((?! </TD>).) * </TD> "and" <TD> .*? </TD> "the efficiency is the same.
Case 2: If one expression contains multipleNumber of unknown matchesTo avoid unnecessary attempt matching.
For example, the expression "<script language = '(.*?) '> (.*?) </SCRIPT> "for example, if the first part of the expression matches successfully when" <script language = 'vbscript'> ", then the following "(.*?) </SCRIPT> "the matching fails, causing the first ".*? "Add matching times and try again. For the true purpose of the expression, let the first ".*? "Adding matching to" VBScript ">" is incorrect, so this attempt is unnecessary.
Therefore, do not let the part of the number of unknown matches cross its boundary for expressions identified by boundaries. In the preceding expression, the first ".*? "It should be rewritten to" [^ '] * ". The ".*? "There is no expression for the number of unknown matches on the right, so this non-Greedy match has no efficiency trap. Therefore, the expression that matches the script block should be written as: "<script language = '([^'] *) '> (.*?) </SCRIPT> "better.