Recursive deep matching Regular Expression

Source: Internet
Author: User
Tags expression engine
Introduction This article will gradually discuss the use of regular expressions. This article is an extension of the Base article on this site. Before reading this article, we recommend that you read the regular expression reference document.
1. Recursive matching of expressions sometimes we need to use regular expressions to analyze the matching of parentheses in a formula. For example, use the expression "\ ([^)] * \)" or "\(.*? \) "Can match a pair of parentheses. However, if there is still a layer of parentheses in the brackets, such as "()", this method cannot match correctly. Result Is "(()". Similarly, HTML supports nested tags, such as "<font> </font>. This section will discuss how to match nested pairs of parentheses or pairs of tags.
Match the nesting of unknown layers:
Some regular expression engines provide special support for such nesting. In addition, the stack space can support nesting at any unknown level, such as Perl, PHP And Greta. In PHP and Greta, the expression uses "(? R) "to indicate the nested part.
The expression for matching the nested "Parentheses" with unknown layers is written as follows: "\ ([^ ()] | (? R ))*\)".
[Perl and PHP examples Code ]
Match finite-level nesting:
For a regular expression engine that does not support nesting, you can only match a finite hierarchy of nesting. Ideas As follows:
Step 1: Write an expression that does not support nesting: "\ ([^ ()] * \)", "<font> ((?! </? Font>).) * </font> ". When the two expressions match the nested text, they only match the innermost layer.
Step 2: Write an expression that matches the nested layer: "\ ([^ ()] | \ ([^ ()] * \)". When the number of nested layers is greater than one time, this expression can only match the innermost two layers. At the same time, this expression can also match non-nested text or nested innermost layer.
Match the "<font>" tag of the nested layer. The expression is: "<font> ((?! </? Font>). | (<font> ((?! </? Font>).) * </font> ". This expression matches only the two layers in the innermost part of the text when the number of nested layers is greater than one.
Step 3: Find the relationship between the expressions matching the nested (n) layer and the expressions of the nested (n-1) layer. For example, the expression that can match the nested (n) layer is:
[Flag header] ([Match expressions other than [Mark header] and [Mark end] | [Match expressions at N-1 layer]) * [Mark end]
Let's look back at the previously written expression "matching nested layers:


\(
(
[^ ()]
|
\ ([^ ()]) * \)
)*
\)

<Font>
(
(?! </? Font> ).
|
(<Font> ((?! </? Font>).) * </font>)
)*
</Font>







PHP and Greta are simple in that the expressions matching the nested (n-1) layer are (? R) indicates:

\(
(
[^ ()]
|
(? R)
)*
\) Step 4: Write an expression that matches a finite (n) layer according to this type of push. Although the expression written in this way looks long, the matching efficiency is still very high after the expression is compiled.
2. the efficiency of non-Greedy matching may be the same as that of me: when we want to match text similar to "<TD> content </TD>" or "bold", we write the following expression based on the forward pre-search function: "<TD> ([^ <] | <(?! /TD>) * </TD> "or" <TD> ((?! </TD>).) * </TD> ".
When a non-Greedy match is found, the expression of the same function can be written as follows: "<TD> .*? </TD> ". If time passes, try to use simple non-Greedy match ".*? ". Especially for complex expressions, non-Greedy match ".*? "The written expressions are indeed much more concise.
However, when an expression contains multiple non-greedy matches or expressions with multiple unknown matches, this expression may have a efficiency trap. Sometimes, the matching speed is incredibly slow, and you even begin to doubt whether the regular expression is practical.
Efficiency traps:
Base on this siteArticleIn the description of non-Greedy match: "If there are few matches, the entire expression will fail to match. This is similar to the greedy mode, non-Greedy pattern will be matched to a minimum to make the entire expression match successfully."
The specific matching process is as follows:

"Non-greedy" first matches the minimum number of times, and then tries to match "expression on the right ". If the expression on the right matches successfully, the entire expression match ends. If the expression on the right fails to match, "non-Greedy part" adds a match and then tries to match "expression on the right ".
If the expression on the right fails to match again, the "non-Greedy part" will be added again. Then try to match "expression on the right ".
The final result of this type of push is "non-greedy" to make the entire expression match successful with as few matching times as possible. Or the final match still fails. When multiple non-greedy matches exist in an expression, use the expression "d (\ W + ?) D (\ W + ?) Z "for example, for the" \ W +? "For example," d (\ W +?) on the Right ?) Z "belongs to its" expression on the right ", for" \ W +? "In the second bracket? "For example," Z "on the right belongs to it" expression on the right ".
When the "Z" match fails, the second "\ W +? "Adds a match" and then tries to match "Z ". If the second "\ W +? "No matter how" Increase the number of matches ", until the end of the entire text," Z "cannot match, it means" d (\ W + ?) Z "matching failed, that is, the first" \ W +? The "right" of "failed to match. At this time, the first "\ W +? "Adds a match and then performs" d (\ W + ?) Z. Loop the previous process until the first "\ W +? "No matter how" add matching Times ", the" d (\ W + ?) If none of Z matches, the entire expression fails to be matched.
In fact, in order to make the entire expression match successfully, greedy match will also properly "give up" the matched characters. Therefore, greedy matching is similar. When an expression has a large number of unknown matching times, to make the entire expression match successfully, each greedy or non-Greedy expression must try to reduce or increase the number of matching times, therefore, it is easy to form a large loop, resulting in a long matching time. This article is called a "trap" because this efficiency issue is often hard to detect.
Example: "d (\ W + ?) D (\ W + ?) D (\ W + ?) When Z "matches" ddddddddddd... ", it takes a long time to judge whether the matching fails.
Efficiency trap avoidance:
The principle to avoid efficiency traps is to avoid "Try matching" of "multiple loops ". It doesn't mean that non-Greedy matching is not good, but when using non-Greedy matching, you must avoid too many "loop attempts.
Case 1: There is no efficiency trap for only one non-greedy or greedy expression. That is to say, to match a text like "<TD> content </TD>", the expression "<TD> ([^ <] | <(?! /TD>) * </TD> "and" <TD> ((?! </TD>).) * </TD> "and" <TD> .*? </TD> "the efficiency is the same.
Case 2: If an expression contains multiple expressions with unknown matching times, do not try matching unless necessary.
For example, the expression "<script language = '(.*?) '> (.*?) </SCRIPT> "for example, if the first part of the expression matches successfully when" <script language = 'vbscript'> ", then the following "(.*?) </SCRIPT> "the matching fails, causing the first ".*? "Add matching times and try again. For the true purpose of the expression, let the first ".*? "Adding matching to" VBScript ">" is incorrect, so this attempt is unnecessary.
Therefore, do not let the part of the number of unknown matches cross its boundary for expressions identified by boundaries. In the preceding expression, the first ".*? "It should be rewritten to" [^ '] * ". The ".*? "There is no expression for the number of unknown matches on the right, so this non-Greedy match has no efficiency trap. Therefore, the expression that matches the script block should be written as: "<script language = '([^'] *) '> (.*?) </SCRIPT> "better.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.