The principle of regular expression matching: backward view depth.

Source: Internet
Author: User

Note: Some of the content remains to be further studied and corrected. Because my work has been too busy recently, I can skip this article if I have not studied it, don't be influenced by my ideas. If you have any research details, please correct question 1.

A few days ago, the CSDN forum encountered such a problem:
Var str = "8912341253789 ";
Remove the repeated numbers in the string, that is, result 89123457.
First, it should be noted that this requirement is not suitable for regular expressions. At least, regular expressions are not the best implementation method.
This issue is not the focus of this article. This article will discuss another regular expression matching principle problem that is derived from the solution to this problem.
Let's take a look at the solutions provided for this problem.

Copy codeThe Code is as follows: string str = "8912341253789 ";
Regex reg = new Regex (@ "(\ d) \ d *?) \ 2 ");
While (str! = (Str = reg. Replace (str, "$1 "))){}
RichTextBox2.Text = str;
/* -------- Output --------
89123457
*/

A friend raised another question as to why the following regular expressions are ineffective.
"(? <= (? <Value> \ d ).*?) \ K <value>"
This article also introduces more in-depth details about reverse-order surround view, involving the matching principle and process of reverse-order surround view. Although the previous two blogs have some introductions, they are not in-depth enough. For details, refer to the topic-based RegEx and RegEx-based lookup. In this article, we will discuss the complex application scenarios of reverse view and reverse reference in depth.
First, let's simplify and abstract the problem. The above regular expressions use the reverse reference of the naming and capturing groups. This increases the complexity of the problem and is written into a common capturing group, and "\ d" is used to replace the ". ", as shown below
"(? <= (\ D) \ d *?) \ 1"
The string to be matched. Abstract The following two typical strings.
Source string 1: 878
Source string 2: 9878
Similar to the regular expression above, the regular expression also has four forms.
Regular Expression 1 :(? <= (\ D) \ d *) \ 1
Regular Expression 2 :(? <= (\ D) \ d *?) \ 1
Regular Expression 3 :(? <= (\ D) \ d * \ 1
Regular Expression 4 :(? <= (\ D) \ d *? \ 1
Let's take a look at the matching results first:Copy codeThe Code is as follows: string [] source = new string [] {"878", "9878 "};
List <Regex> regs = new List <Regex> ();
Regs. Add (new Regex (@"(? <= (\ D) \ d *) \ 1 "));
Regs. Add (new Regex (@"(? <= (\ D) \ d *?) \ 1 "));
Regs. Add (new Regex (@"(? <= (\ D) \ d * \ 1 "));
Regs. Add (new Regex (@"(? <= (\ D) \ d *? \ 1 "));
Foreach (string s in source)
{
Foreach (Regex r in regs)
{
RichTextBox2.Text + = "Source string:" + s. PadRight (8 ,'');
RichTextBox2.Text + = "regular expression:" + r. ToString (). PadRight (18 ,'');
RichTextBox2.Text + = "matching result:" + r. Match (s). Value + "\ n ---------------------- \ n ";
}
RichTextBox2.Text + = "------------------------ \ n ";
}

/* -------- Output --------
Source string: 878 regular expression :(? <= (\ D) \ d *) \ 1 matching result: 8
------------------------
Source string: 878 regular expression :(? <= (\ D) \ d *?) \ 1 matching result:
------------------------
Source string: 878 regular expression :(? <= (\ D) \ d * \ 1 matching result: 78
------------------------
Source string: 878 regular expression :(? <= (\ D) \ d *? \ 1 matching result: 78
------------------------
------------------------
Source string: 9878 regular expression :(? <= (\ D) \ d *) \ 1 matching result:
------------------------
Source string: 9878 regular expression :(? <= (\ D) \ d *?) \ 1 matching result:
------------------------
Source string: 9878 regular expression :(? <= (\ D) \ d * \ 1 matching result: 78
------------------------
Source string: 9878 regular expression :(? <= (\ D) \ d *? \ 1 matching result: 78
------------------------
------------------------
*/
This result may be unexpected to many people. When I first started to contact this problem, I was also confused. After two days, I got a chance to touch it, I figured out the key points of the problem and will discuss them later.
Before that, there may be two more instructions:
1. The topics discussed below are not closely related to the problems mentioned in this article. The first question is mainly to bring up the topic of this article. The question itself is not within the scope of the discussion, this article also focuses on pure theory.
2. This Article is applicable to readers with regular expressions. If you are confused about the matching results and the matching process of the above regular expressions, the following will help you solve the problem; however, if you are not clear about the meanings of the metacharacters and syntaxes in the above regular expressions, let's start with the basics.
2. in-depth research on Reverse-order surround view matching principles

Regular Expression 1 :(? <= (\ D) \ d *) \ 1
Regular Expression 2 :(? <= (\ D) \ d *?) \ 1
Regular Expression 3 :(? <= (\ D) \ d * \ 1
Regular Expression 4 :(? <= (\ D) \ d *? \ 1

The above regular expressions can be abstracted as "(? <= SubExp1) an expression like SubExp2 can be classified into three types based on the characteristics of SubExp1 during Reverse-order view Principle Analysis:

1. the subexpression "SubExp1" in reverse-order view has a fixed length. Regular Expressions 3 and 4 belong to this category. Of course, this category includes "?". This quantizer is also limited to this quantizer.
2. The length of the subexpression "SubExp1" in the reverse-order view is not fixed, Including ignoring the priority quantifiers, such as "*?" , "+ ?" , "{M ,}?" And so on, that is, the non-Greedy mode. Regular Expressions belong to this type.
3. the subexpression "SubExp1" in the reverse-order view is not fixed in length, including matching priority quantifiers, "*", "+", and "{m, that is, the greedy pattern. Regular Expressions belong to this type.

The following is an analysis of the matching process for these three types of regular expressions.

Analysis on matching process of 2.1 fixed length subexpressions
2.1.1 matching process of source string 1 + Regular Expression 3

Source string 1: 878
Regular Expression 3 :(? <= (\ D) \ d * \ 1
First, try matching at location 0, starting from "(? <= (\ D) "gets control, fixed length, only one digit, from position 0 to left to find one, failed," (? <= (\ D) "matching failed, causing the first round of matching attempts to fail.
The regularizedengine drive is forward driven by a match attempt at position 1, and the control is handed over to "(? <= (\ D) ", search for one digit on the left, then hand the control to" (\ d) ", and further hand the control to" \ d ". After "\ d" gets control, it tries to match to the right and matches "8". Then "(? <= (\ D) "matched successfully. The matching result is position 1. The content matched by capture group 1 is" 8 ", and the control is handed over to" \ d *". Because "\ d *" is greedy, the system first tries to match "7" and "8" after position 1. The matching succeeds, records the Backtracking status, and gives control to "\ 1 ". Because the content captured in the previous capture group 1 is "8", "\ 1" must be matched to "8" before the matching is successful. At this time, the matching has reached the end of the string and fails, "\ d *" backtracking, giving the last character "8", and then giving the control to "\ 1". "\ 1" matches the last "8, the entire expression is matched successfully. Because "(? <= (\ D) "only matches the position and does not occupy characters. Therefore, the result of matching the entire expression is" 78 ", "\ d *" matches "7" and "\ 1" matches "8 ".
2.1.2 matching process of source string 2 + Regular Expression 3

Source string 2: 9878
Regular Expression 3 :(? <= (\ D) \ d * \ 1
The matching process of this combination is similar to that of Section 2.1.1, except that there is only one round of matching attempts.
2.1.3 matching process of source string 1 + Regular Expression 4
Source string 1: 878
Regular Expression 4 :(? <= (\ D) \ d *? \ 1
First, try matching at location 0, starting from "(? <= (\ D) "gets control, fixed length, only one digit, from position 0 to left to find one, failed," (? <= (\ D) "matching failed, causing the first round of matching attempts to fail.
The regularizedengine drive is forward driven by a match attempt at position 1, and the control is handed over to "(? <= (\ D) ", search for one digit on the left, then hand the control to" (\ d) ", and further hand the control to" \ d ". After "\ d" gets control, it tries to match to the right and matches "8". Then "(? <= (\ D) "matched successfully. The result of the match is location 1. The content matched by capture group 1 is" 8 ", and the control is handed over to" \ d *?". Because "\ d *?" In non-Greedy mode, the system first tries to ignore the matching, record the Backtracking status, and hand over the control to "\ 1 ". Because the content captured in the previous capture group 1 is "8", "\ 1" must match "8" to make the match successful, at this time, the character after location 1 is "7". The matching fails. "\ d *?" Backtracing: try to match the character "7" after position 1, and then hand the control to "\ 1". "\ 1" matches the final "8, the entire expression is matched successfully. Because "(? <= (\ D) "only matches the position and does not occupy characters. Therefore, the result of matching the entire expression is" 78 ", where" \ d *?" "7" is matched, "\ 1" is matched with the final "8 ".
This is basically the same as the matching process of the combination in section 2.1.1, except that "\ d *" and "\ d *?" The matching process is different from the Backtracking process.
2.1.4 matching process of source string 2 + Regular Expression 4
Source string 2: 9878
Regular Expression 4 :(? <= (\ D) \ d *? \ 1
The matching process of this combination is similar to that of section 2.1.3.
2.2 Non-Greedy pattern subexpression Matching Process Analysis
2.2.1 matching process of source string 1 + Regular Expression 2
Source string 1: 878
Regular Expression 2 :(? <= (\ D) \ d *?) \ 1
First, try matching at location 0, starting from "(? <= (\ D) \ d *?)" Get control, length is not fixed, at least one bit, from position 0 to the left to find one, failed, "(? <= (\ D) \ d *?)" The matching fails, causing the first round of matching attempts to fail.
The regularizedengine drive is forward driven by a match attempt at position 1, and the control is handed over to "(? <= (\ D) \ d *?)", Search for one digit on the left, then hand the control to "(\ d)", and further hand the control to "\ d ". After "\ d" obtains control, it tries to match to the right, matches "8", and passes control to "\ d *?", Because "\ d *?" In non-Greedy mode, the system first tries to ignore the match, that is, it does not match any content and records the Backtracking status. In this case, "(\ d) \ d *?" If the match is successful, then "(? <= (\ D) \ d *?)" The match is successful. The match result is position 1, because the subexpression "(\ d) \ d *?" Here *?" In non-Greedy mode, after a successful match is obtained, control is handed over and all backtracing statuses are discarded. Because the content captured in the previous capture group 1 is "8", "\ 1" must match "8" to make the match successful, at this time, the character after position 1 is "7", and there is no backtracing status. The entire expression fails to match at position 1.
The regularizedengine drive is forward driven by a match attempt at location 2, and the control is handed over to "(? <= (\ D) \ d *?)", Search for one digit on the left, then hand the control to "(\ d)", and further hand the control to "\ d ". After "\ d" obtains control, it tries to match to the right, matches "7", and passes control to "\ d *?", Because "\ d *?" In non-Greedy mode, the system first tries to ignore the match, that is, it does not match any content and records the Backtracking status. In this case, "(\ d) \ d *?" If the match is successful, then "(? <= (\ D) \ d *?)" The match is successful, and the matching result is location 2, because the subexpression "(\ d) \ d *?" Here *?" In non-Greedy mode, after a successful match is obtained, control is handed over and all backtracing statuses are discarded. Because the content captured in the previous capture group 1 is "7", "\ 1" must match "7" to make the match successful, at this time, the character after location 2 is "7", and there is no backtracing status. The entire expression fails to match at location 2.
The matching process at location 3 is the same. The final "\ 1" causes the entire expression to fail to match because there is no matching character.
At this time, all the positions of the string have been tried and the matching fails. Therefore, the entire expression fails to match and no valid matching result is obtained.
2.2.2 source string 2 + Regular Expression 2 matching process
Source string 1: 9878
Regular Expression 2 :(? <= (\ D) \ d *?) \ 1
The matching process of this combination is similar to that of Section 2.2.1.
2.3 greedy pattern subexpression Matching Process Analysis
2.3.1 matching process of source string 1 + Regular Expression 1
Source string 1: 878
Regular Expression 2 :(? <= (\ D) \ d *) \ 1
First, try matching at location 0, starting from "(? <= (\ D) \ d *) "gets control, the length is not fixed, at least one bit, from the position 0 to the left to find a bit, failed," (? <= (\ D) \ d *) "matching failed, causing the first round of matching attempts to fail.
The regularizedengine drive is forward driven by a match attempt at position 1, and the control is handed over to "(? <= (\ D) \ d *) ", search for a digit on the left, then hand the control to" (\ d) ", and further hand the control to" \ d ". After "\ d" obtains control, it tries to match to the right and matches "8". The control is handed over to "\ d *" because "\ d *" is greedy, matches are attempted first, and the backtracing status is recorded. However, no matching characters are available at this time. Therefore, the matching fails, backtracing, does not match any content, and the backtracing status is discarded, in this case, "(\ d) \ d *" is matched successfully and the matching content is "8". Then "(? <= (\ D) \ d *) "the match is successful. The match result is position 1. Because the subexpression here is greedy," (\ d) \ d * "after a successful match is obtained, you need to find whether there is a longer match and find the longest match before the control is handed over. Search left again. There are no characters, and "8" is the longest match. At this time, control is handed over and all backtracing statuses are discarded. Because the content captured in the previous capture group 1 is "8", "\ 1" must match "8" to make the match successful, at this time, the character after position 1 is "7", and there is no backtracing status. The entire expression fails to match at position 1.
The regularizedengine drive is forward driven by a match attempt at location 2, and the control is handed over to "(? <= (\ D) \ d *) ", search for a digit on the left, then hand the control to" (\ d) ", and further hand the control to" \ d ". After "\ d" obtains control, it tries to match to the right and matches "7". The control is handed over to "\ d *" because "\ d *" is greedy, matches are attempted first, and the backtracing status is recorded. However, no matching characters are available at this time. Therefore, the matching fails, backtracing, does not match any content, and the backtracing status is discarded, in this case, "(\ d) \ d *" is matched successfully and the matching content is "7". Then "(? <= (\ D) \ d *) ", the match is successful, and the match result is location 2. Because the subexpression here is greedy," (\ d) \ d * "after a successful match is obtained, you need to find whether there is a longer match and find the longest match before the control is handed over. Search left again, and try to match right from position 0. After "\ d" gets control, "8" at position 0 is matched and the control is handed over to "\ d *", because "\ d *" is greedy mode, it will first try to match and record the Backtracking status, matching "7" at position 1 is successful, then "(\ d) \ d * "indicates that the match is successful. Then," (\ d) \ d * "finds a successful match with the matching content being" 87 ", the value of capture group 1 matches "8 ". Search left again. There are no characters, and "87" is the longest match. At this time, control is handed over and all backtracing statuses are discarded. Because the content captured in the previous capture group 1 is "8", "\ 1" matches the "8" at location 2 successfully. At this time, the entire dashboard match is successful.
In the demo, Match is used, and only one Match is obtained. In fact, if Matches is used, the regular expression needs to try all the positions. For this combination, the same principle is true, at location 3, the match must fail because "\ 1" has no characters for matching.
At this point, the matching of this combination is complete, and there is a successful match. The matching result is "8", and the matching start position is position 2, that is, the matching content is the second "8 ".
2.3.2 matching process of source string 2 + Regular Expression 1
Source string 2: 9878
Regular Expression 2 :(? <= (\ D) \ d *) \ 1
First, try matching at location 0, starting from "(? <= (\ D) \ d *) "gets control, the length is not fixed, at least one bit, from the position 0 to the left to find a bit, failed," (? <= (\ D) \ d *) "matching failed, causing the first round of matching attempts to fail.
The regularizedengine drive is a forward drive, which is matched by position 1. The matching process of this round is similar to that of Section 2.3.1 in position 1, except that "(\ d) \ d * "matches" 9 "and" 9 "in capture group 1. Therefore," \ 1 "fails to match, as a result, the entire expression fails to match at position 1.
The regularizedengine drive is a forward drive, which is matched by position 2. the matching process of this round is similar to that of Section 2.3.1 in position 2. First, "(\ d) \ d *" finds a successful match. The matched content is "8", and the content matching group 1 is also "8 ", in this case, try matching on the left and find another successful match. The matched content is "98", and the content matched by capture group 1 is also "9 ", there are no characters in the left lookup, so "98" is the longest match, "(? <= (\ D) \ d *) "matched successfully. The matching result is at 2. Because the content matched by capture group 1 is "9" at this time, "\ 1" fails to match at location 2, causing the entire expression to fail to match at location 2.
The regularizedengine drive is a forward drive, which is matched at position 3. the matching process of this round is similar to that of the previous round at position 2. First, "(\ d) \ d *" finds a successful match item "7", continues to the left, finds another successful match item "87", and then tries to the left, find another successful match item "987", which is the longest match, hand over control, and discard all backtracing statuses. In this case, the content matched by capture group 1 is "9", so "\ 1" fails to match at location 3, causing the entire expression to fail to match at location 3.
At location 4, the matching must fail because "\ 1" does not have any characters for matching.
So far, all matching attempts at all positions in the source string have been completed, the entire expression fails to match, and no successful match is found.

Conclusion 2.4

The analysis of the above matching process seems complicated. You can grasp the following points.
1. When the forward-order loop view neutron expression is of a fixed length, either the matching is successful or the matching fails. There is nothing to say.
2. In non-Greedy mode, when a forward-direction loop view neutron expression finds a successful match, it gives control and discards all the backtracing states.
3. When the subexpression in the reverse-order loop view is greedy, the control is handed over and all backtracing statuses are discarded only when the longest matching successful item is found.
That is, for the regular expression "(? <= SubExp1) SubExp2 ", once" (? <= SubExp1) "hand over control, the matched position is fixed, and the content matched by" SubExp1 "is fixed, and there is no backtracing status.
3. Overview of reverse-order surround view matching principles
Let's summarize the regular expression "(? <= SubExp1) SubExp2 "matching process. Shows the matching principle of reverse view.

Figure 3-1 reverse view matching schematic

Regular Expression "(? The <= SubExp1) SubExp2 matching process can be divided into two flows: The primary matching process and the submatching process. The primary matching process is shown in.

Figure 3-2 main matching Flowchart

Master matching process:

1. Try matching from position 0 to the right. <= SubExp1) "before the minimum length requirement, the matching must fail until such a position x is found, and x meets" (? <= SubExp1) "minimum length requirement;
2. Search for position y that meets the minimum length requirement of "SubExp1" from position x to the left;
3. The "SubExp1" starts from position y and tries matching to the right. At this time, an independent sub-matching process is entered;
4, if the "SubExp1" position y Sub-match also requires the next wheel match, then to the left to find a y', that is, the Y-1 re-enter the independent sub-match process, this loop continues until the next matching wheel is no longer needed. If the child matching succeeds, step 5 is displayed. If the final matching fails, the entire expression matching failure is reported;
5. "(? <= SubExp1) "after successful match, the control is handed over to the subexpression" SubExp2 ", and the matching attempt continues until the entire expression matches successfully or fails, the report shows that the entire expression matches successfully or fails at location x;
6. If necessary, search for the next position X' and start a new round of matching.
Shows the sub-matching process.

Figure 3-3 child matching Flowchart

Child matching process:

1. After Entering the sub-match, the source string is determined, that is, the sub-string between position y and position x, and the regular expression is changed to "^ SubExp1 $ ", in this wheel match, once the match is successful, the starting position of the match must be y, and the ending position of the match must be x;
2. If the length of a sub-expression is fixed, a match is either successful or failed, and the matching result is returned without matching on the next wheel;
3. When the length of a sub-expression is not fixed, the non-Greedy mode or greedy mode should be distinguished;
4. In non-Greedy mode, if the matching fails, the report fails, and the next wheel match is required. If the matching succeeds, all backtracing statuses are discarded, and the report succeeds, and there is no need to try the next wheel match;
5. If it is in greedy mode, the matching fails, the report fails, and the next wheel must be matched. If the matching is successful, all backtracing statuses are discarded, the report is successful, and the matching successful content is recorded, in addition, it is required to try the next round of matching until the longest match is obtained;
In a specific round of matching, the position of x is fixed, and the subexpression "SubExp1" in the reverse-order loop view is prior to the final matching result of the report, the starting position of the match is unpredictable and can be determined only after more than one round of sub-match. However, the end position of the match must be position x.
Of course, this is only for a specific round of matching. When this round of matching fails, the regular engine drive will drive forward, so that x = x + 1, and then enter the next round of matching attempts, until the entire expression reports that the matching is successful or fails.
So far, the matching principle of reverse-order view has basically been analyzed. Of course, there are more complicated ones. For example, "SubExp1" contains both greedy mode subexpressions and non-Greedy mode subexpressions, however, no matter how complicated it is, it is necessary to follow the above matching principle. Therefore, as long as you understand the above matching principle, there is no secret to view the reverse order.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.