Regular application of reverse-looking exploration. _ Regular Expressions

Source: Internet
Author: User
1 problem Leads

A few days ago in the CSDN forum encountered such a problem.
I'm going to remove the string from the following <font color= "#008000" > and </font>

1, in the <font color= "#008000" the string between > and </font> is not fixed, is automatically generated randomly
2, which <font color= "#008000" the number of > and </font> can not be fixed, but also automatically generated by random

<font color= "#008000" > * * here is an unfixed string 1 * * </font>
<font color= "#008000" > * * here is an unfixed string 2 * * </font>
<font color= "#008000" > * * here is an unfixed string 3 * * </font>
Have friends give such a regular "(?<=<font[\s\s]*?>) ([\s\s]*?) (?=</font>) ", look at the results of the match.
Copy Code code as follows:

String test = @ "<font color=" "#008000" "> * * * This is an unfixed string 1 * * </font>
<font color= "" #008000 "" > * * * Here is an unfixed string 2 * * </font>
<font color= "" #008000 "" > * * * Here is the invariant string 3 * * </font> ";
MatchCollection mc = regex.matches (Test, @ "(?<=<font[\s\s]*?>) ([\s\s]*?) (?=</font>) ");
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n---------------\ n";
}
/*--------Output--------
* * Here is an unfixed string 1 * *
---------------
<font color= "#008000" > * * here is an unfixed string 2 * *
---------------
<font color= "#008000" > * * here is an unfixed string 3 * *
---------------
*/
Why is this the result, not the result we expect?
/*--------Output--------
* * Here is an unfixed string 1 * *
---------------
* * Here is an unfixed string 2 * *
---------------
* * Here is an unfixed string 3 * *
---------------
*/

This involves the matching principle of reverse-looking, and some details about the application of greedy and non-greedy patterns, starting with a discussion of the matching details in reverse order, and then looking back at the problem.

2 Reverse-scan matching principle

Some basic explanations of the look and basic matching principles, in the regular basis--look around this blog has been introduced, but in a hurry to tidy up, did not involve more detailed matching details. This is only discussed in reverse order.
The basics of reverse-looking look are described in posting above, which is simply quoted here.

An expression

Description

(? <=expression)

The reverse is definitely looking around, indicating that the left side of the position can match Expression

(? <! Expression)

Reverse negation a glance, indicating that the left side of the position cannot match Expression


For the reverse-order positive look (? <=expression), when the subexpression Expression matched successfully, the (? <=expression) match succeeded, and the (? <=expression) match was reported to be successful at the current location.

For reverse negation look around (? <! Expression), when the subexpression Expression matched successfully, (? <! Expression) match fails, when the subexpression Expression match fails, (? <! Expression) match successfully, and report (? <! Expression) matches the current location successfully.

Analysis of 2.1 Reverse-scan matching behavior
2.1.1 Reverse Survey Support status
There are still fewer languages to support reverse-looking, such as the current, more popular scripting language JavaScript does not support reverse-looking. The personal view that not supporting reverse reversal is the current maximum limit for use in JavaScript, and some of the input validation that is easily done in reverse order, is implemented in a variety of ways.

Requirements: Verify that the input consists of letters, numbers, and underscores, which cannot be placed at the beginning or end of the line.

For such a demand, if support reverse-looking, direct "^ (?! _) [a-za-z0-9_]+ (? <!_) $ "can be done, but in JavaScript, it needs to be done in a workaround similar to" ^[a-za-z0-9 "([a-za-z0-9_]*[a-za-z0-9]) $. This is just a simple example, the actual application will be much more complex than this, and in order to avoid the nesting of quantifiers in the efficiency trap, it is difficult to achieve, and even some cases have to be split into several regular to achieve.

and other popular languages, such as Java, although support reverse-looking, but only support fixed-length subexpression, quantifiers only support "?", other indefinite length of quantifiers such as "*", "+", "{m,n}" and so is not supported.

SOURCE string: <div>a test</div>
Requirements: Get the contents of the DIV tag, excluding the DIV tag itself
Java Code Implementation:
Copy Code code as follows:

Import java.util.regex.*;
String test = "<div>a test</div>";
String reg = "(?<=<div>) [^<]+ (?=</div>)";
Matcher m = Pattern.compile (reg). Matcher (test);
while (M.find ())
{
System.out.println (M.group ());
}
/*--------Output--------
A test
*/

However, if the source string changes, adding a property to "<div id=" test1 ">a test</div>", then you cannot use reverse-looking in Java unless the property content in the label is fixed.

Why is it that in many popular languages, there is either no support for reverse-looking, or only a fixed-length subform? First, analyze the matching principle of reverse-scan.

Analysis on the principle of reverse 2.1.2 in Java

It does not support the need to look in reverse order, only to support the reverse sequence of fixed-length subexpression.
SOURCE string: <div>a test</div>
Regular expression: (?<=<div>) [^<]+ (?=</div>)

It is important to be clear that no matter what regular expression you are, you are trying to match from the position 0 of the string.
First by "(?<=<div>)" to take control, starting with position 0 to taste the match, because the length of "<div>" fixed to 5, so it will be from the current position to look left 5 characters, but because at this time at position 0, there is no previous characters, so try to match failed.
The regular engine gearing drives to the right, starts with the position 1 to attempt the match, the same match fails, until position 5, finds 5 characters to the left, satisfies the condition, at this time puts the control to "(?<=<div>)" The Child expression "<div>". "<div>" after the control, from the position 0 start to the right to try to match, because the regular is a character to match, so the control to the "<div>" in the "<" by "<" to try the string "<", Match successfully, Next by "D" to try "D" in the string, the match succeeds, the same process, the "<div>" success between "<div>" matches position 0 to position 5, at which time "(?<=<div>)" Match succeeds, the match succeeds position is position 5.
The following matching process should refer to the--NFA engine matching principle of the regular basis----looking and regular basis.
What about the quantifier "?" What about a situation, look at the example below.
SOURCE string: CBA
Regular expression: (? <= (C?b)) a
Copy Code code as follows:

String test = "CBA";
String reg = "(? <= (C?b)) a";
Matcher m = Pattern.compile (reg). Matcher (test);
while (M.find ())
{
System.out.println (M.group ());
System.out.println (M.group (1));
}
/*--------Output--------
A

*/

You can see, "C?" And did not participate in the match, here, "?" Does not have the role of greedy mode, "?" Provides only one branch of the function, a total of two branches, one branch needs to look forward from the current position of a character, the other branch needs to look forward from the current position of two characters. The regular engine tries both cases from the current position, prioritizing the branch that needs to look forward for fewer characters, and if the match succeeds, no longer tries another branch, only to try another branch if the branch match fails.
Copy Code code as follows:

String test = "DCBA";
String reg = "(? <= (Dc?b)) a";
Matcher m = Pattern.compile (reg). Matcher (test);
while (M.find ())
{
System.out.println (M.group ());
System.out.println (M.group (1));
}
/*--------Output--------
A
Dcb
*/

Although there are two branches, the number of characters looking forward is predictable, so only the "?" is supported. is not complicated, but what if the other indefinite length quantifiers are supported?
2.1.3. The principle of reverse-scan matching in net
. NET in reverse-looking, is to support indefinite length quantifiers, at this time, the matching process becomes complicated. Let's look at how the fixed length matches.
Copy Code code as follows:

String test = "<div>a test</div>";
Regex reg = new Regex (@ "(?<=<div>) [^<]+ (?=</div>)");
Match m = Reg. Match (test);
if (m.success)
{
Richtextbox2.text + = M.value + "\ n";
}
/*--------Output--------
A test
*/

From the results can be seen,. NET in reverse order when the subexpression length is fixed, the matching behavior should be the same as in Java. So what about the indefinite length classifier?
Copy Code code as follows:

String test = "CBA";
Regex reg = new Regex (@ "<= (c?b)) a");
Match m = Reg. Match (test);
if (m.success)
{
Richtextbox2.text + = M.value + "\ n";
Richtextbox2.text + = m.groups[1]. Value + "\ n";
}
/*--------Output--------
A
Cb
*/

You can see, here's "?" Have the characteristics of greedy mode. So is there a question of whether the matching process is still trying to go from the current position to the left, or is it trying to match from the beginning of the string to the right?
Copy Code code as follows:

String test = "<DDD<CCCBA";
Regex reg = new Regex (@ "<= (<.*?b)) a");
Match m = Reg. Match (test);
if (m.success)
{
Richtextbox2.text + = M.value + "\ n";
Richtextbox2.text + = m.groups[1]. Value + "\ n";
}
/*--------Output--------
A
<cccb
*/

From the results can be seen in reverse-looking around the indefinite quantifiers, is still from the current position, to the left to try to match, otherwise groups[1] content is "&LT;DDD&LT;CCCB", rather than "&LT;CCCB."
This is not a greedy pattern matching situation, and then look at the greedy pattern matching situation.
Copy Code code as follows:

String test = "E<DDD<CCCBA";
Regex reg = new Regex (@ "<= (<.*b)) a");
Match m = Reg. Match (test);
if (m.success)
{
Richtextbox2.text + = M.value + "\ n";
Richtextbox2.text + = m.groups[1]. Value + "\ n";
}
/*--------Output--------
A
<ddd<cccb
*/

As you can see, after using greedy mode, although you have tried to match the "<" before "C", you can still try to match it because it is greedy. Until you try to get to the start position, take the longest successful match as the result of the match.
2.2 Matching Process
Let's get back to the matching process of reverse-looking.
SOURCE string: <div id= "Test1" >a test</div>
Regular expression: (?<=<div[^>]*>) [^<]+ (?=</div>)


First by "(?<=<div[^>]*>)" To obtain control, starting from position 0 to taste the match, because "<div[^>]*>" length is not fixed, so will be from the current position to the left character lookup, of course, it is possible that the engine did the optimization, First calculate the minimum length after looking forward, where "<div[^>]*>" requires at least 5 characters, so from the current position to the left to find 5 characters to start trying to match, this depends on how the regular engine of each language to achieve, I guess is the first calculation of the minimum length. However, the attempt to match failed because it is at position 0 at this time and there is no previous character.

The regular engine gearing drives to the right, starts with the position 1 to attempt the match, the same match fails, until position 5, looks for 5 characters to the left, satisfies the condition, at this time controls the right to "(?<=<div[^>]*>)" In the Child expression "<div[^> ]*> ". "<div[^>]*>" after the control, from the position 0 start to the right to try to match, because the positive is all character to match, so at this time will give control to the "<div[^>]*>" in "<" by "<" Try "<" in the string, the match succeeds, next by "D" in the string "D", the match succeeds, the same process, by "<div[^>]*" matching position 0 to position 5 between "<div" success, where "[^>]*" in matching "< The space in Div "is to record the state that is available for backtracking, at this time the control to" > ", because there are no characters to match, so" > "Match failed, at this time backtracking, by" [^>]* "out of the matching space to" > "to match, the same match failed, There is no state available for backtracking at this time, so this round of match attempts failed.

Regular engine transmission to the right, from the position 6 start to try to match, the same match failed, until the position of 16, the current position refers to position 16, the control to the "(?<=<div[^>]*>)" to the left to find 5 characters, to meet the conditions, Records the backtracking status, and control is given to the subexpression "<div[^>]*>" in "(?<=<div[^>]*>)". "<div[^>]*>" after the control, from position 11 to try to match the right, "<" in "<div[^>]*>" try the string "s", the match failed. Continue to try to the left, at location 10 by the "<" attempt string "E", the match failed. The same process, until the attempt to position 0, by "<div[^>]*" at position 0 to the right to try to match, successfully matched to the "<div id=" Test1 ">", at this Time "(?<=<div[^>]*>)" Match successfully, Control is given to [^>]+] to continue the following match until the entire expression is matched successfully.

summarizes the matching process for the regular expression "(? <=subexp1) SubExp2":

1. Try to match from position 0 to right until a position x satisfying "(? <=subexp1)" Minimum length requirement is found;
2, from the position x to the left to meet the "SUBEXP1" minimum length requirements of the position y;
3, by "SUBEXP1" from the position y start to the right to try to match;
4, if "SUBEXP1" is a fixed length or not greedy mode, then find a successful match to stop trying to match;
5, if "SUBEXP1" is greedy mode, then try all the possible, take the longest successful match as the result of the match.
6, "(? <=subexp1)" After the successful match, control to the subsequent subexpression, continue to try to match.

A point to note, the subexpression "SUBEXP1" in the reverse-looking look, when the match succeeds, the position at which the match begins is unpredictable, but the position at which the match ends must be position x.

3 Problem Analysis and summary

3.1 Problem Analysis
Then look back at the original question.
Copy Code code as follows:

String test = @ "<font color=" "#008000" "> * * * This is an unfixed string 1 * * </font>
<font color= "" #008000 "" > * * * Here is an unfixed string 2 * * </font>
<font color= "" #008000 "" > * * * Here is the invariant string 3 * * </font> ";
MatchCollection mc = regex.matches (Test, @ "(?<=<font[\s\s]*?>) ([\s\s]*?) (?=</font>) ");
foreach (Match m in MC)
{
Richtextbox2.text + = M.value + "\ n---------------\ n";
}
/*--------Output--------
* * Here is an unfixed string 1 * *
---------------
<font color= "#008000" > * * here is an unfixed string 2 * *
---------------
<font color= "#008000" > * * here is an unfixed string 3 * *
---------------
*/

In fact, what is really confusing is the matching result of the reverse-looking look here, in order to better illustrate the problem, change the regular.
String test = @ "<font color=" "#008000" "> * * * This is an unfixed string 1 * * </font>
Copy Code code as follows:

<font color= "" #008000 "" > * * * Here is an unfixed string 2 * * </font>
<font color= "" #008000 "" > * * * Here is the invariant string 3 * * </font> ";
MatchCollection mc = regex.matches (Test, @) (? <= (<font[\s\S]*?>)) ([\s\s]*?) (?=</font>) ");
for (int I=0;I&LT;MC. count;i++)
{
Richtextbox2.text + = "First" + (i+1) + "round successful match result: \ n";
Richtextbox2.text + = "Group[0]:" + m.value + "\ n";
Richtextbox2.text + = "Group[1]:" + m.groups[1]. Value + "\ n---------------\ n";
}
/*--------Output--------
1th round successful match result:
GROUP[0]: * * Here is the invariant string 1 * *
Group[1]:<font color= "#008000" >
---------------
2nd round successful Match result:
GROUP[0]:
<font color= "#008000" > * * here is an unfixed string 2 * *
Group[1]:<font color= "#008000" > * * here is an unfixed string 1 * * </font>
---------------
3rd round successful Match result:
GROUP[0]:
<font color= "#008000" > * * here is an unfixed string 3 * *
Group[1]:<font color= "#008000" > * * here is an unfixed string 2 * * </font>
---------------
*/

There should be no doubt that there should be no explanation for the first round of successful match results.
The end of the first round of successful matches is the position before the first "</font>", where the second round of successful match attempts begins.
First by "(?<=<font[\s\s]*?>)" To take control, look left 6 characters after the start to try to match, because "<" will match failed, so will always try to position 0, then "<font" can be matched successfully, but because of " <font[\s\S]*?> "to match successfully, the matching end position must be the position before the first" </font> ", so" > "is the match failed, and the entire expression match fails in this position.
The regular engine actuator drives to the right, until the first "</font>" position, "<font[\s\S]*?>" matches successfully, the matching start position is position 0, the matching end position is the first "</font>" position, " <font[\s\S]*?> "Matching to the content is" <font color= "#008000" > * * * Here is the invariant string 1 * * </font> "where" [\s\s]*?) Match to the content is "color=" #008000 "> * * here is an unfixed string 1 * * </font", the following subexpression continues to match until the second round matches successfully.
The next third round of successful match, the matching process is basically the same as the second round, only because of the use of the greedy mode, so "<font[\s\S]*?>" in the match to "<font color=" #008000 "> * * here is not fixed string 2 * * </font> "When the match is successful, the match is ended and the left attempt is no longer matched."
Next look at the matching result of the greedy pattern.
Copy Code code as follows:

String test = @ "<font color=" "#008000" "> * * * This is an unfixed string 1 * * </font>
<font color= "" #008000 "" > * * * Here is an unfixed string 2 * * </font>
<font color= "" #008000 "" > * * * Here is the invariant string 3 * * </font> ";
MatchCollection mc = regex.matches (Test, @) (? <= (<font[\s\S]*>)) ([\s\s]*?) (?=</font>) ");
for (int I=0;I&LT;MC. count;i++)
{
Richtextbox2.text + = "First" + (i+1) + "round successful match result: \ n";
Richtextbox2.text + = "Group[0]:" + m.value + "\ n";
Richtextbox2.text + = "Group[1]:" + m.groups[1]. Value + "\ n---------------\ n";
}
/*--------Output--------
1th Round match Result:
GROUP[0]: * * Here is the invariant string 1 * *
Group[1]:<font color= "#008000" >
---------------
2nd round match Result:
GROUP[0]:
<font color= "#008000" > * * here is an unfixed string 2 * *
Group[1]:<font color= "#008000" > * * here is an unfixed string 1 * * </font>
---------------
3rd round match Result:
GROUP[0]:
<font color= "#008000" > * * here is an unfixed string 3 * *
Group[1]:<font color= "#008000" > * * here is an unfixed string 1 * * </font>
<font color= "#008000" > * * here is an unfixed string 2 * * </font>
---------------
*/
Only a character difference, the whole expression of the matching results have not changed, but the matching process is very different.
So what do you do if you want to get the following results?
/*--------Output--------
* * Here is an unfixed string 1 * *
---------------
* * Here is an unfixed string 2 * *
---------------
* * Here is an unfixed string 3 * *
---------------
*/

It is possible to narrow the matching range of the subexpression that is decorated with quantifiers.
Copy Code code as follows:

String test = @ "<font color=" "#008000" "> * * * This is an unfixed string 1 * * </font>
<font color= "" #008000 "" > * * * Here is an unfixed string 2 * * </font>
<font color= "" #008000 "" > * * * Here is the invariant string 3 * * </font> ";
MatchCollection mc = regex.matches (Test, @) (? is) (? <= (<font[^>]*>)) (?:(?! </?font\b).) * (?=</font>) ");
for (int I=0;I&LT;MC. count;i++)
{
Richtextbox2.text + = "First" + (i+1) + "wheel match result: \ n";
Richtextbox2.text + = "Group[0]:" + mc[i]. Value + "\ n";
Richtextbox2.text + = "Group[1]:" + mc[i]. GROUPS[1]. Value + "\ n---------------\ n";
}
/*--------Output--------
1th Round match Result:
GROUP[0]: * * Here is the invariant string 1 * *
Group[1]:<font color= "#008000" >
---------------
2nd round match Result:
GROUP[0]: * * Here is the invariant string 2 * *
Group[1]:<font color= "#008000" >
---------------
3rd round match Result:
GROUP[0]: * * Here is the invariant string 3 * *
Group[1]:<font color= "#008000" >
---------------
*/

3.2 Reverse-Scan Application Summary
Through the analysis of reverse survey, we can see that the use of indefinite length of quantifiers in reverse order, the matching process is very complex, the cost is also very large, this may be the current majority of languages do not support reverse-looking, or do not support the use of indefinite length quantifiers in reverse survey.
Some points to note in regular applications:
1, do not easily in reverse-looking around the use of indefinite length of quantifiers, unless it is really necessary;
2, in any scene, not just reverse-looking, do not easily use quantifiers to modify the matching range of very large subexpression, decimal point "." and "[\s\s]", especially when used.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.