The reverse sequence of the regular matching principle is deep. _ Regular Expressions

Source: Internet
Author: User

Description: Some of the content to be further studied and revised, because the recent work is too busy, temporarily out of time, has not studied the can skip this article, want to study not to be my thinking about the, there is a clear study also please correct me 1 questions

A few days ago in the CSDN forum encountered such a problem:
var str= "8912341253789";
You need to get rid of the duplicate number in this string, which is the result 89123457.
The first thing to note is that this requirement is not suitable for positive implementation, at least, is not the best way to implement.
This problem is not the focus of this article, this article will be discussed, mainly by the solution of this problem and lead to another problem of the regular matching principle.
Let's take a look at the solution given to the problem itself.

Copy Code code as follows:

String str = "8912341253789";
Regex reg = new Regex ((\d) \d*?) \2 ");
while (str!= (str = reg.) Replace (str, "$"))}
Richtextbox2.text = str;
/*--------Output--------
89123457
*/

Based on this a friend asked another question, why use the following is no effect
"(<= (? <value>\d). *?) \k<value> "
In this paper, we also introduce some details about the reverse order of this article, which involves the matching principle and matching process of reverse-scan. The previous two blogs, although also introduced, but not deep enough, the reference is the basis of--look and regular application--reverse-looking exploration. In this paper, we will make a deep discussion on the reverse view by combining reverse and reverse reference in this complex application scene.
To simplify and abstract the problem first, the above is using the reverse reference of named Capture group and named Catch Group, which increases the complexity of the problem to some extent, writes the common capturing group, and replaces the too large "." with "\d" as follows
"(<= (\d) \d*?) \1 "
Need to match the string, abstract, take two kinds of typical string below.
SOURCE string One: 878
SOURCE string Two: 9878
Like the regular expression above, there are four different forms of regular expressions.
Regular expression one: (? <= (\d) \d*) \1
Regular expression two: (? <= (\d) \d*?) \1
Regular expression three: (? <= (\d)) \d*\1
Regular Expression IV: (? <= (\d)) \d*?\1
First look at the matching results:
Copy Code code as follows:

string[] Source = new string[] {"878", "9878"};
list<regex> regs = new list<regex> ();
Regs. ADD (New Regex (@ <= (\d) \d*) \1));
Regs. ADD (The new Regex (? <= (\d) \d*?) \1 "));
Regs. ADD (New Regex (@ <= (\d)) \d*\1));
Regs. ADD (New Regex (@ <= (\d)) \d*?\1));
foreach (string s in Source)
{
foreach (Regex R in Regs)
{
Richtextbox2.text + = "Source string:" + s.padright (8, ');
Richtextbox2.text + = "Regular expression:" + r.tostring (). PadRight (18, "");
Richtextbox2.text + = "Match result:" + r.match (s). Value + "\ n------------------------\ n";
}
Richtextbox2.text + = "------------------------\ n";
}

/*--------Output--------
SOURCE string: 878 Regular expression: (? <= (\d) \d*) \1 Match results: 8
------------------------
SOURCE string: 878 Regular expression: (? <= (\d) \d*?) \1 Match Result:
------------------------
SOURCE string: 878 Regular expression: (? <= (\d)) \d*\1 Match results: 78
------------------------
SOURCE string: 878 Regular expression: (? <= (\d)) \d*?\1 Match results: 78
------------------------
------------------------
SOURCE string: 9878 Regular expression: (? <= (\d) \d*) \1 match result:
------------------------
SOURCE string: 9878 Regular expression: (? <= (\d) \d*?) \1 Match Result:
------------------------
SOURCE string: 9878 Regular expression: (? <= (\d)) \d*\1 Match results: 78
------------------------
SOURCE string: 9878 Regular expression: (? <= (\d)) \d*?\1 Match results: 78
------------------------
------------------------
*/
This result may be unexpected to many people, just beginning to contact with this problem, I am also confused, put two days later, just a touch, figured out the crux of the problem, the following will be discussed.
Before that, you might need to do two more notes:
1, the topic discussed below has not much to do with the problems mentioned in this article, the initial problem is mainly to draw the topic of this article, the problem itself is not within the scope of the discussion, and this article is mainly a pure theoretical discussion.
2, this article is suitable for a certain regular basis of the reader. If you are puzzled by the above regular matching results and the matching process, it does not matter, but if you are not clear about the meaning of the above positive characters and the grammatical representation, start with the basics.
2 reverse-scan matching principle in depth

Regular expression one: (? <= (\d) \d*) \1
Regular expression two: (? <= (\d) \d*?) \1
Regular expression three: (? <= (\d)) \d*\1
Regular Expression IV: (? <= (\d)) \d*?\1

Some of the above regular expressions can be finally abstracted as "(? &LT;=SUBEXP1) SubExp2" Such expressions, in the reverse-looking principle analysis, according to the "SUBEXP1" characteristics, can be summed up to three categories:

1. The subexpression "SUBEXP1" in the reverse-looking look is fixed, and regular expressions three and four fall into this category, and of course, this category includes "?" This quantifier, but also limited to this one quantifier.
2. The subexpression "SUBEXP1" in reverse-looking look is not fixed in length, which includes ignoring the precedence quantifiers, such as "*?", "+", "{m,}?" And so on, which is usually called non greedy mode, regular expression two belongs to this class.
3. The subexpression "SUBEXP1" in reverse-looking look is not fixed in length, which includes matching precedence quantifiers, "*", "+", "{m,}", and so on, which is usually called greedy mode, and regular expressions belong to this category.

The following is an analysis of the matching process for these three types of regular expressions.

Analysis of 2.1 fixed-length subexpression matching process
2.1.1 Source string One + regular expression three-match process

SOURCE string One: 878
Regular expression three: (? <= (\d)) \d*\1
First try to match at position 0, by "(? <= (\d))" To gain control, fixed length, only one, from position 0 to the left to find one, failed, "(? <= (\d))" Match failed, causing first-round match attempt to fail.
Regular engine drive forward transmission, by the position 1 try to match, control to "(? <= (\d))", to the left to find one, and then the control to the "(\d)" To further control the right to "\d." "\d" after the control, to the right to try to match, match the "8" success, at this time "(<= (\d))" Match successfully, match the result is position 1, capturing group 1 match to the content is "8", Control to "\d*". As the "\d*" for greedy mode, will first attempt to match the position 1 after "7" and "8", matching success, record backtracking status, control to "\1". Because the previous capture group 1 captures the content "8", so "\1" to match to "8" to match the success, and at this time has reached the end of the string, matching failure, "\d*" backtracking, give up the final character "8", and then the control to "\1", by "\1" match the final "8" success, The entire expression now matches successfully. Because "(? <= (\d))" matches only the position, does not occupy the character, so the entire expression matches to the result is "78", where "\d*" matches is "7", "\1" matches to "8".
2.1.2 Source string Two + regular expression three-match process

SOURCE string Two: 9878
Regular expression three: (? <= (\d)) \d*\1
The matching process for this combination is basically similar to the 2.1.1 matching process, but it's just one more round of matching attempts, no longer repeating.
2.1.3 Source string One + regular expression four-match process
SOURCE string One: 878
Regular Expression IV: (? <= (\d)) \d*?\1
First try to match at position 0, by "(? <= (\d))" To gain control, fixed length, only one, from position 0 to the left to find one, failed, "(? <= (\d))" Match failed, causing first-round match attempt to fail.
Regular engine drive forward transmission, by the position 1 try to match, control to "(? <= (\d))", to the left to find one, and then the control to the "(\d)" To further control the right to "\d." "\d" after the control, to the right to try to match, matching the "8" success, at this time "(<= (\d))" Match success, matching knot is the fruit for position 1, capturing group 1 match to the content is "8", control to "\d*?". Because of "\d*?" For non greedy mode, will first attempt to ignore the match, record backtracking status, control to "\1". Because the previous capture group 1 captures the content is "8", so "\1" to match to "8" in order to match the success, and at this time the character after position 1 is "7", the match failed, "\d*?" Backtracking, try to match the character "7" after position 1, then give control to "\1", the "\1" match the final "8" success, at this time the entire expression match successfully. Because "(? <= (\d))" matches only the position and does not occupy the character, the entire expression is matched to the result "78", where "\d*?" The match is "7" and "\1" matches the last "8".
This is basically consistent with the matching process of the 2.1.1 section combination, just "\d*" and "\d*". Matching is different from the backtracking process.
2.1.4 Source string Two + regular expression four-match process
SOURCE string Two: 9878
Regular Expression IV: (? <= (\d)) \d*?\1
The matching process of this combination is basically similar to the matching process in section 2.1.3, and it is no longer to repeat here.
Analysis of 2.2 Non-greedy pattern sub-expression matching process
2.2.1 Source string One + regular expression two-match process
SOURCE string One: 878
Regular expression two: (? <= (\d) \d*?) \1
First try to match at position 0, by "(<= (\d) \d*?)" Get control, the length is not fixed, at least one bit, from position 0 to the left to find one, failed, "(<= (\d) \d*?)" The match failed, causing the first-round match attempt to fail.
Regular engine transmission forward drive, by the position 1 try to match, control to "(? <= (\d) \d*)", to the left to find one, and then the control to the "(\d)" To further control the right to "\d." "\d" after the control, to the right to try to match, matching the "8" success, the control to the "\d*", because of "\d*?" For non-greedy mode, the preference is to ignore the match, that is, to not match anything, and to record the backtracking state, at which point "(\d) \d*?" Match succeeded, then "(<= (\d) \d*?)" The match is successful, and the result is position 1, because the subexpression "(\d) \d* here?" For non greedy mode, when a successful match is made, the control is surrendered and all backtracking states are discarded. Because the previous capture group 1 captures the content is "8", so "\1" to match to "8" to match the success, and at this time the character after position 1 is "7", there is no state to backtrack, the entire expression at position 1 failed to match.
Regular engine transmission forward drive, by the position 2 try to match, control to "(? <= (\d) \d*)", to the left to find one, and then the control to the "(\d)" To further control the right to "\d." "\d" after the control, to the right to try to match, matching the "7" success, the control to the "\d*", because of "\d*?" For non-greedy mode, the preference is to ignore the match, that is, to not match anything, and to record the backtracking state, at which point "(\d) \d*?" Match succeeded, then "(<= (\d) \d*?)" The match is successful, and the result is position 2, because the subexpression "(\d) \d* here?" For non greedy mode, when a successful match is made, the control is surrendered and all backtracking states are discarded. Because the previous capture group 1 captures the content is "7", so "\1" to match to "7" to match the success, and at this time the character after position 2 is "7", there is no state to backtrack, the entire expression at position 2 failed to match.
At position 3, the matching process is the same, and the last "\1" has no characters to match, resulting in the entire expression matching failure.
All positions of the string have been tried at this time, and the match failed, so the entire expression match failed without any valid match results.
2.2.2 Source string Two + regular expression two-match process
SOURCE string One: 9878
Regular expression two: (? <= (\d) \d*?) \1
The matching process of this combination is basically similar to the matching process in section 2.2.1, and it is no longer to repeat here.
Analysis of 2.3 Greedy mode subexpression matching process
2.3.1 Source string One + regular expression one matching process
SOURCE string One: 878
Regular expression two: (? <= (\d) \d*) \1
First try to match at position 0, by "(? <= (\d) \d*)" Take control, the length is not fixed, at least one bit, from position 0 to the left to find one, failed, "(? <= (\d) \d*)" Match failed, causing first-round match attempt to fail.
Regular engine drive forward transmission, by the position 1 try to match, control to "(? <= (\d) \d*)", to the left to find one, and then the control to the "(\d)" Further to the control of "\d." After "\d" gets control, it tries to match the right. Match "8" success, the control to "\d*", because "\d*" for greedy mode, will first attempt to match, and record backtracking status, but there is no available to match the characters, so match failed, backtracking, does not match any content, discard backtracking state, at this time " (\d) \d* "match succeeded, match content is" 8 ", then" (?) <= (\d) \d*) "matches successfully, the match result is position 1, because the subexpression here is greedy mode," (\d) \d* "to get a successful match, you need to find whether there is a longer match, When the longest match is found, the control is surrendered. And then left to find, no characters, "8" is already the longest match, at this time to surrender control, while discarding all backtracking status. Because the previous capture group 1 captures the content is "8", so "\1" to match to "8" to match the success, and at this time the character after position 1 is "7", there is no state to backtrack, the entire expression at position 1 failed to match.
Regular engine drive forward transmission, by the position 2 try to match, control to "(? <= (\d) \d*)", to the left to find one, and then the control to the "(\d)" Further to the control of "\d." After "\d" gets control, it tries to match the right. Match "7" success, the control to "\d*", because "\d*" for greedy mode, will first attempt to match, and record backtracking status, but there is no available to match the characters, so match failed, backtracking, does not match any content, discard backtracking state, at this time " (\d) \d* "match succeeded, match content is" 7 ", then" (?) <= (\d) \d*) "matches successfully, the match result is position 2, because the subexpression here is greedy mode," (\d) \d* "to get a successful match, you need to find whether there is a longer match, When the longest match is found, the control is surrendered. And then left to find, from position 0 to the right to try to match, "\d" to gain control, match position 0 "8" success, the control to "\d*", because "\d*" for greedy mode, will first attempt to match, and record backtracking status, matching position 1 "7" success, at this time "(\d) \d*" The match was successful, then "(\d) \d*" found a successful match with the match "87", where capturing group 1 matched to "8". And then left to find, no characters, "87" is already the longest match, at this time to surrender control, while discarding all backtracking status. The "8" match at location 2 was successful because of the "8" captured in the previous capture group 1, when the entire \1 match was successful.
Match is used in the demo routines, to take only one match, in fact, if you are using matches, regular expressions need to try all positions, for this combination, the same reason, at position 3, because "\1" no characters to match, so the match must be failed.
At this point, this combination of matching completed, there is a successful match, the match result is "8", matching the start position of 2, that is, the match to the content of the second "8."
2.3.2 Source string Two + regular expression one matching process
SOURCE string Two: 9878
Regular expression two: (? <= (\d) \d*) \1
First try to match at position 0, by "(? <= (\d) \d*)" Take control, the length is not fixed, at least one bit, from position 0 to the left to find one, failed, "(? <= (\d) \d*)" Match failed, causing first-round match attempt to fail.
The regular engine drives forward drive, which is attempted to match by position 1, the matching process of this round is similar to that of the 2.3.1 combination at position 1, except that "(\d) \d*" matches to "9" and capturing group 1 matches to "9", so the "\1" match fails. Causes the entire expression to fail at position 1.
The regular engine drives forward drive, which is matched by the position 2, and the matching process of this round is similar to the matching process of the 2.3.1 section at position 2. First, "(\d) \d*" finds a successful match, matching the content to "8", capture Group 1 match to the content is also "8", at this point and then left to try to match, and found a successful match, matching to the content is "98", capturing group 1 to match the content is also "9", and then left to find when there is no characters, So "98" is the longest match, and the "(? <= (\d) \d*)" Match is successful and the result is position 2. Because the capture Group 1 match is "9", the "\1" match fails at position 2, causing the entire expression to fail at position 2.
The regular engine drives forward drive, which is matched by position 3, and the matching process is similar to that of the previous round at position 2. First "(\d) \d*" finds a successful match "7", continue to try to the left, find a successful match of "87", then try to the left, and find a successful match "987", at which time the longest match, hand over control, and discard all backtracking states. The capture Group 1 match was "9" so "\1" failed at position 3, causing the entire expression to match at position 3.
Position 4 Finally, the match must fail because "\1" has no characters to match.
At this point, all of the matching attempts at all locations in the source string have been completed, and the entire expression has failed and no successful match was found.

2.4 Summary

The above matching process analysis, seemingly complicated, in fact, grasp the following points can be.
1, reverse-looking around the neutron expression is fixed length, either match the success or match the failure, there is nothing to say.
2, in reverse order to look around the neutron expression is not greedy mode, as long as a match to find a successful item, that is, surrender control, and discard all the state of backtracking.
3. In reverse order when a neutron expression is greedy, the control is surrendered only when the longest matching success item is found, and all the states that are available for backtracking are discarded.
In other words, for regular expression "(? <=subexp1) SubExp2", Once "(? &LT;=SUBEXP1)" is surrendered, the position it matches is fixed, and the "SUBEXP1" match is fixed and there is no state to backtrack.
A summary of 3 reverse-scan matching principle
Let's summarize the matching process of regular expression "(? <=subexp1) SubExp2". The matching schematic of the reverse-scan is shown in the following figure.

Fig. 3-1 Reverse View matching schematic

Regular expression "(? <=subexp1) SubExp2" matching process, can be divided into the main matching process and the child matching process two processes, the main matching process as shown in the following figure.

Fig. 3-2 Main matching flowchart

Master matching process:

1, from position 0 to the right to try to match, in the Find Meet "(? <=subexp1)" Minimum length requirements of the position, match must be failed until found such a position x,x meet the "(? <=subexp1)" Minimum length requirements;
2, from the position x to the left to meet the "SUBEXP1" minimum length requirements of the position y;
3, by "SUBEXP1" from the position y start to the right to try to match, at this time into an independent child matching process;
4, if "SubExp1" in position y match also need to match the next wheel, then to the left to find a Y ', that is, y-1 back into the independent child matching process, so cycle until no longer need the next wheel match, the child matching success is entered step 5, the final match fails to report the entire expression matching failure;
5, "(? <=subexp1)" After the successful match, control to the subsequent subexpression "SUBEXP2", continue to try to match until the entire expression matching success or failure, report at position x, the entire expression matching success or failure;
6, if necessary, continue to find Next position x ', and start a new round of attempt to match.
The child matching process is shown in the following figure.

Figure 3-3 Child matching flowchart

Child matching procedure:

1, after entering a child match, the source string is determined, that is, the substring between position y and position x, and the regular expression at this time becomes "^subexp1$", because in this wheel match, Once the match is successful, the matching start position must be Y, the match end position must be x;
2, when the subexpression length is fixed, either the match succeeds or the match fails, the match is returned, and the next wheel match is not required;
3, when the subexpression length is not fixed, distinguish between the greedy pattern and the greedy pattern ;
4, if it is a non-greedy pattern, the match failed, the report failed, and the next wheel match was required; match succeeded, discard all backtracking status, report successfully, and no longer need to try the next wheel match;
5, if it is greedy mode, the match fails, the report fails, and the next wheel match is required ; match succeeded, discard all backtracking status, report success, records the success of this match and requires that the next wheel match be attempted until the longest match is reached;
in a particular round match, the position of x is fixed, and the subexpression "SUBEXP1" in reverse-looking look, before the final match is reported , the match start position is unpredictable, need to pass through more than one round of child matching to determine, but the location of the end of the match must be position x.
Of course, this is only for a particular round match, when the match fails, the regular engine drive drives forward, makes the x=x+1, and then goes on to the next round of matching attempts until the entire expression reports a match success or failure.
The matching principle of the reverse scan is almost finished, of course, there are more complex, such as "SUBEXP1" contains both greedy-mode subexpression, but also contains a non-greedy mode subexpression, but no matter how complex, are to follow the above matching principle, so as long as the understanding of the above matching principle, There is no secret to looking in reverse.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.