The regular foundation--greed and non-greedy pattern

Source: Internet
Author: User
Tags advantage
The Regular Foundation--greed and non-greedy pattern

1 Overview

Greed and non-greedy mode affect the matching behavior of the subexpression modified by the quantifier, and the greedy pattern matches as many as possible on the premise that the whole expression matches successfully, and not the greedy pattern matches the success of the whole expression, as little as possible. Non-greedy mode is supported only by some NFA engines.

Quantifiers, which belong to greedy patterns, are also called matching precedence quantifiers, including:

' {m,n} ', ' {m,} ', '? ', ' * ' and ' + '.

In some languages that use NFA engines, adding "?" after matching the precedence classifier, which becomes a quantifier that belongs to a non greedy pattern, is also called ignoring the precedence classifier, including:

"{m,n}?", "{m,}?", "??", "*?" and "+?".

From the point of view of the regular grammar, the subexpression modified by the matching precedence classifier uses the greedy pattern, such as "(Expression) +"; The subexpression modified by the ignored precedence classifier uses a non greedy pattern, such as "(Expression) +?".

for greedy mode, the name of a variety of documents are basically consistent, but for the non-greedy mode, some lazy mode or inertia mode, and some called reluctantly mode, in fact, what is called, as long as the principle and use, can be used freely that is. Personal habits Use the term greedy and not greedy, so the article will use this name to introduce. 2 greedy and non-greedy pattern matching principle

For greedy and non-greedy mode, can be applied and the principle of two points to understand, but if you want to really grasp, or from the principle of matching to understand.

first, from an application perspective, answer "what is greedy and not greedy mode." " 2.1 Analysis of greedy and non-greedy patterns from the perspective of application 2.1.1 What is greed and non-greedy mode

First look at an example

Example:

SOURCE string: AA<DIV>TEST1</DIV>BB<DIV>TEST2</DIV>CC

Regular expression a:<div>.*</div>

Match result a:<div>test1</div>bb<div>test2</div>

Regular Expression two:<div>.*?</div>

Match result two:<div>test1</div> (this refers to a match result, so does not include <div>test2</div>)

According to the above example, from the matching behavior analysis, what is greedy and not greedy mode.

A regular expression uses a greedy pattern, the entire expression can be matched successfully when matching to the first "</div>", but because of the greedy pattern, still try to match to the right to see if there is a longer string that can be successfully matched to the second "</ After "div>", there is no substring that can be successfully matched to the right, and the match ends with the result "<div>test1</div>bb<div>test2</div>". Of course, the actual matching process is not the case, the following matching principle will be described in detail.

Only from the application point of view, you can think that greedy mode, is the entire expression matching the premise of success, as much as possible matching, that is, the so-called "greed", popular point, is to see what you want, how many to pick up, unless there is no longer want.

Regular Expression II adopts the non-greedy mode, in the match to the first "</div>" to make the entire expression match successfully, because the use of the greedy mode, so the end of the match, no longer to the right to try, the match result is "<div>test1</div>".

only from the perspective of application, can think so, not greedy mode, is in the whole expression match the premise of success, as little as possible matching, that is, the so-called "not greedy", popular point, is to find a want to pick up on the line, as to whether there is no not to pick up on the matter. 2.1.2 on the prerequisites

In the above analysis of greedy and non-greedy patterns from the application perspective, a prerequisite is always mentioned is "the whole expression matching success", why should we emphasize this premise, we look at the following example.

Regular expression three: <div>.*</div>bb

Match result three: <DIV>TEST1</DIV>BB

Modify the "." is still the matching priority classifier "*", so here is the greedy mode, the front "<div>.*</div>" can still be matched to "<div>test1</div>bb<div>test2 </div>, but because the "BB" behind does not match successfully, then "<div>.*</div>" must give up the matching "bb<div>test2</div>", To make the entire expression match successfully. The result of the entire expression is "<div>test1</div>bb", and "<div>.*</div>" matches the content "<div>test1</div>". It can be seen that, in the premise of "the whole expression match succeeds", the greedy mode really affects the matching behavior of the subexpression, if the whole expression match fails, the greedy pattern will only affect the matching process, and the effect of the matching result is not discussed.

Non-greedy patterns also have the same problem, see the following examples.

Regular expression four: <DIV>.*?</DIV>CC

Match result four: <DIV>TEST1</DIV>BB<DIV>TEST2</DIV>CC

This is a non-greedy pattern, the previous "<div>.*?</div>" is still matched to "<div>test1</div>", at which point "CC" cannot match successfully, requirements "< Div>.*?</div> "You must continue to try to match the right, until the match is" <div>test1</div>bb<div>test2</div> ", the following" CC " In order to match successfully, the whole expression matches successfully, and the matching content is "<DIV>TEST1</DIV>BB<DIV>TEST2</DIV>CC", where "<div>.*?</ Div> "matches the" <div>test1</div>bb<div>test2</div> "content. It can be seen that, in the premise of "the whole expression match succeeds", the non greedy mode really affects the matching behavior of the subexpression, if the whole expression match fails, the non greedy mode cannot affect the matching behavior of the subexpression. 2.1.3 Greed or not greed--the choice of application

Through the application of the analysis, has been a basic understanding of the greedy and non-greedy model of the characteristics of the actual application, whether to choose the greedy mode, or not greedy mode, which should be determined according to demand.

For some simple requirements, such as the source character "Aa<div>test1</div>bb", then get the div tag, using greedy and non-greedy mode can achieve the desired results, which may not be very related to the use of.

However, in the case of 2.1.1, in practical applications, only need to get a pair of div tags, that is, not greedy mode to match the content, greedy mode to match the content is usually not what we need.

then why is there a greedy mode of existence, from the application point of view is difficult to give a satisfactory answer, this need from the point of view of the matching principle to analyze the greedy and non-greedy mode. 2.2 Analysis of greedy and non-greedy patterns from the angle of matching principle

If you want to really understand what is greedy mode, what is the greedy mode, under what circumstances, respectively, how efficient, it can not only from the application point of view, but to fully understand the greedy and non-greedy pattern matching principle. 2.2.1 On the basic matching principle

NFA Engine Basic Matching principle reference: Regular basis of--NFA engine matching principle.

This paper mainly focuses on the matching principle involved in greedy and non greedy patterns. Let's look at the simple matching process of greedy patterns.

SOURCE string: "Regex"

Regular expression: ". *"

Figure 2-1

Note: In order to be able to see the clear matching process, the above gap remains larger, the actual source string is "Regex", the same below.

Take a look at the matching process. First, the first "" "to obtain control, matching the position of the 0-bit" ", matching the success of control to the". * ".

". *" after obtaining control, because "*" is a matching priority quantifier, in the case of matching can not match, the preference for a match. Try to match from "R" at position 1, match succeeded, continue to match right, match the "E" at position 2, the match succeeds, continues to match to the right until the match to the end of "" ", the match succeeds, because at this time has been matched to the end of the string, so". * "End match, the control to the regular expression last" " ”。

"" "after obtaining control, because the match has failed at the end of the string, look forward to the state of backtracking, control is given to". * "by". * "to make a character, that is," "at the end of the string, and then give control to the final" "of the regular expression, by" "" matching the end of the string. " ", the match was successful.

At this point the entire regular expression matches successfully, where the ". *" Match is "Regex", and a backtracking is made during the match.

Next look at the simple matching process of the non greedy pattern.

SOURCE string: "Regex"

Regular expression: ". *?"

Figure 2-2

Look at the process of matching non greedy patterns. First, the first "" "to obtain control, matching the position of the 0-bit" ", matching the success of control to the". *? ".

“.*?” After gaining control, because of "*?" is to ignore the precedence quantifier, in case the match can not match, the priority attempt does not match, because "*" is equivalent to "{0,}", so in ignoring the priority, you can not match anything. Try to ignore the match from position 1, that is, not match anything, and give control to the last "" "of the regular expression.

"" "After obtaining control, try to match from position 1, by" "" "" Match position 1 "R", match failed, look forward to the status of backtracking, control to ". *?", by ". *?" Eat a character, match the position 1 "R", and then give control to the regular expression of the last "".

After obtaining control, try to match from position 2, by "" "Match position 1" E ", match failed, look forward to the state of backtracking, repeat the above process until the". *? " Match to "X" and then give control to the final "" "of the regular expression.

After the control is obtained, an attempt is made to match from position 6, by "" to match the last "" of the string, and the match succeeds.

at this point the entire regular expression matches successfully, where ". *?" The match was "Regex", and five backtracking was made during the match. 2.2.2 Greed or not greed--the choice of matching efficiency

Through the analysis of the matching principle, we can see that, in the case of matching success, greedy mode has less backtracking, and backtracking process, need to control the handover, let out the matching content or match the unmatched content, and try to match, to a large extent reduce the matching efficiency, so greedy mode and non-greedy mode, There is an advantage in matching efficiency.

But the example in 2.2.1, just a simple application, readers see here, whether there will be such a doubt, greedy mode is certainly more efficient than the non-greedy pattern matching. The answer is in the negative.

Example:

Requirements: Gets the substring from two "", which can no longer contain "".

Regular expression one: ". *"

Regular expression two: ". *?"

Situation one: When greedy patterns match more unwanted content, there may be more backtracking than non greedy patterns. For example, the source string is "the word" Regex "means regular expression."

Situation Two: Greedy mode can not meet the demand. For example, the source string is "the phrase" regular expression ' is called ' Regex ' for short. "

For case one, the regular expression of a greedy pattern, ". *" will always match to the end of the string, control to the final "", the match is unsuccessful, then backtrack, due to multiple matching content "means regular expression." Far more than the need to match the content itself, so using regular expressions for a while, the matching efficiency is less than the use of regular expression two of the non greedy mode.

For the situation two, the regular expression one match to is "the regular expression" is called "the Regex", even if the demand is not satisfied, naturally also does not have any matching efficiency high and low.

The above two kinds of situation is universal, then is not satisfies the demand, but also takes into account the efficiency, can only use the not greedy pattern. Of course not, according to the actual situation, change matching priority quantifier modified subexpression, not only can satisfy the demand, but also can improve the matching efficiency.

SOURCE string: "Regex"

Give regular expression three: "[^"]* "

Look at the matching process for regular expression three.

Figure 2-3

First from the first "" "to obtain control, match the position of 0-bit" ", match the success, control power to" [^ "]*".

"[^"]* "after gaining control, because" * "is a matching priority quantifier, in the case of matching can not match, the first attempt to match. Try to match from "R" at location 1. The match succeeds, continues to match to the right, matches the position 2 "E", the match succeeds, continues to match to the right, until matches to "X", the match succeeds, then matches the end "" ", the match fails, will control to the regular expression final" "".

"" "" "at the end of the match string after the control is obtained, the match succeeds.

The entire regular expression was successfully matched, where "[^"]* "matches" Regex "and no backtracking was made during the match.)

The child expression decorated with quantifiers is replaced by a range of "." With an excluded character set "[^]", which is still greedy, and perfectly solves the problem of demand and efficiency. Of course, because this matching process does not backtrack, so there is no need to record backtracking state, so you can use the curing group, the positive to do further optimization.

Give the regular expression four: "(? >[^"]*)

solidified groupings are not supported by all languages, such as. NET support, which Java does not support, but in Java it can be replaced by a simpler possessive classifier: "[^]*+." 3 greed or non-greedy mode--on matching efficiency

In general, greed and non-greedy patterns, if quantifiers are decorated with the same subexpression, such as ". *" and ". *", their application scenarios are usually different, so the efficiency is generally not comparable.

As for changing the subexpression of quantifier modification to satisfy the requirement, such as ". *" instead of "[^"]*), because the modified subexpression is different and does not have direct contrast. But in situations where the same subexpression can satisfy a requirement, such as "[^"]* "and" [^ "]*?"), greedy patterns are usually more efficient to match.

At the same time there is also the fact that the non-greedy mode can be implemented, through the optimization of quantifiers modified by the expression of the greedy model can be implemented, and greedy mode can achieve some of the optimization effect, but may not be not greedy mode can be achieved.

Greedy mode also has the advantage that when the match fails, greedy mode can report the failure more quickly, thereby improving the matching efficiency. The following is a comprehensive review of the matching efficiency of greedy and non-greedy patterns. 3.1 Efficiency improvement--evolution process

After understanding the rationale for the matching of greedy and non-greedy patterns, let's take a look at the evolutionary process of regular efficiency promotion again.

Requirements: Gets the substring from two "", which can no longer contain "".

SOURCE string: The phrase "regular expression" is called "Regex" for short.

Regular expression one: ". *"

A regular expression matches a "regular expression" is called "Regex" and does not meet the requirements.

Put forward the regular expression two: ". *?"

First "" "" "to obtain control, from the position of 0 start to try to match, until the position 11 match successfully, control to". *? ", matching process with the 2.2.1 of non-greedy pattern matching process. “.*?” The match was "Regex", and four backtracking was made during the match.

How to eliminate the loss of the matching efficiency caused by backtracking is to use a smaller range of subexpression, adopt greedy mode, and propose regular expression three: "[^"]* "

First "" "" "to obtain control, starting from the position of 0 to try to match, until the position 11 match successfully, control to" [^ "]*, matching process with the 2.2.2 section of the non-greedy pattern matching process. "[^ ']*" matches the "Regex", and no backtracking is performed during the match. 3.2 Efficiency improvements-faster reporting failures

The above discussion is to match the successful evolution process, and for a regular expression, in the case of matching failure, if the fastest speed to report the match failure, it will improve the matching efficiency, this may be the most easily overlooked in our design process. If the source string data is very large, or the regular expression is more complex, whether the ability to report matching failure quickly will have a direct impact on the matching efficiency.

The following builds a regular expression that matches the failure and analyzes the matching process.

in the following matching process analysis, the source string is unified as follows: The phrase "regular expression" is called "Regex" to short. Analysis of 3.2.1 Non-greedy pattern matching failure process

Figure 3-1

Construct a regular expression of a non-greedy pattern that matches failed: ". *?" @

Because the last "@" exists, this regular expression must finally match the failure, then look at the matching process.

First, "" "" "to obtain control, from the position 0 start to try to match, matching failure, until the map marked a matching success, control to". *? ".

“.*?” After the control is obtained, the position of a behind is tried to match, because the greedy mode, first ignore the match, give the control to "", and record the backtracking state. "" "" "after the control, from the position after the start of the attempt to match, matching the character" R "failed to find the state of backtracking, the control to". *? ", by". *? " Matches the character "R". Repeat the above process until ". *?" Matches the character "n" in front of B, "" "matches the character" "at B, and the control is given to" @ ". The "@" matches the next Space "", the match fails, and the status of backtracking is found, and control is given to ". *?" by ". *?" Matches a space. Continue to repeat the above matching process until the ". *?" Match to the end of the string, handing control over to "". The match failed because it was already the end of the string, and the entire expression was reported to have failed at position 11, and a round of matching attempts ended.

The regular engine gearing makes the positive forward drive and enters the next round of attempts. The subsequent matching process is basically similar to the first round of the attempt to match the process, and can be referred to in Figure 3-1.

from the matching process, we can see that the non greedy pattern matching failure process, almost every step is accompanied by the backtracking process, the impact on the matching efficiency is very large. 3.2.2 Greedy pattern matching failure Process Analysis--a large range of subexpression

Figure 3-2

PS: The above analysis process diagram refers to the "proficient in regular expression," a section of the relevant chapter diagram.

Build a regular expression that matches a failed greedy pattern: ". *" @

The quantifier-decorated subexpression is a "." with a large matching range, and because of the existence of the last "@", this regular expression is also a certain match failure, look at the matching process.

First by "" "" "to obtain control, from the position 0 start to try to match, match failed, until the map marked a match success, control to". * ".

After the ". *" Gain control, an attempt is made to match from the position behind a, because it is greedy mode, the optimization tries to match, has been matched to the end of the string, and the control is given to "". "" "" "after the control, because it is already the end of the string, match failed, look for the state of backtracking, the control to". * "by". * "to give up the matched character". ". Repeat the above procedure until "" "" "" "," "" "" "," "" "" "" "" " The "@" matches the space "" at the next d, and the match fails to find a state for backtracking, and control is given to ". *" by ". *" to yield the matched text. Continue to repeat the above matching process until the ". *" yields all the matching text to I, handing control over to "". "" "Match failed because there is no state available for backtracking, the entire expression is reported to have failed at position 11, and a round of matching attempts has ended.

The regular engine gearing makes the positive forward drive and enters the next round of attempts. The subsequent matching process is basically similar to the first round of the attempt to match the process, and can be referred to in Figure 3-2.

from the matching process, we can see that the matching failure process of the large-scale subexpression greedy pattern, in general, is not different from the non greedy mode, the final backtracking times are basically consistent with the non greedy mode, and the effect on the matching efficiency is still great. 3.2.3 Greedy pattern matching failure Process Analysis--an improved sub-expression

Figure 3-3

Build a regular expression that matches the greedy pattern that failed: "[^]*" @

The quantifier-decorated subexpression is changed to match the smaller exclusion character group "[^]", because the last "@" existence, this regular expression also must match the failure, look at the matching process.

First by "" "" "to obtain control, from the position 0 start to try to match, match failed, until the map marked a match success, control to" [^ "]*.

"[^"]* "after the control, by a position after the start of the attempt to match, because it is greedy mode, the first attempt to match, has been matched to B, the control to" ". "" "matches the next character" ", the match succeeds, and the control is given to" @ ". The "@" matches the next Space "", the match fails, looks for the backtracking state, and control is given to "[^"]* ", by" [^ "]*") to yield the matched text. Continue to repeat the above matching process until the "[^"]*] yields all the matched text to C, giving control to "". "" "Match failed because there is no state available for backtracking, the entire expression is reported to have failed at position 11, and a round of matching attempts has ended.

The regular engine gearing makes the positive forward drive and enters the next round of attempts. The subsequent matching process is basically similar to the first round of the attempt to match the process, and can be referred to in Figure 3-3.

from the matching process, we can see that the matching failure process of the greedy pattern of the excluded character group is reduced, and the matching efficiency can be improved effectively by reducing the number of backtracking per round in general. 3.2.4 Greedy pattern matching failure Process Analysis--solidification grouping

Through the analysis of the 3.2.3 section, it is possible to know that, because "[^"]* "uses an excluded character group, so in Figure 3-3, the character that is matched between A and B is definitely not the character" ", so the backtracking between B and C is superfluous, that is to say, the state of backtracking between the two is completely out of the record. NET can use the solidification grouping, in Java can use occupies the first quantifier to achieve this effect.

Figure 3-4

First by "" "" "to obtain control, from the position 0 start to try to match, match failed, until the map marked a match success, control to" (? >[^ "]*).

"(?) >[^"]* "after the control, from the position after the start of the attempt to match, because it is greedy mode, the first attempt to match, has been matched to B, the control to" "", in this matching process, do not record any state of backtracking. "" "matches the next character" ", the match succeeds, and the control is given to" @ ". "@" matches the next space "", Match failed, look for the state of backtracking, because there is no state to backtrack, report the entire expression at position 11 match failed, a round of match attempt to end.

The regular engine gearing makes the positive forward drive and enters the next round of attempts. The subsequent matching process is basically similar to the first round of the attempt to match the process, and can be referred to in Figure 3-4.

from the matching process, we can see that the matching failure process using the greedy mode of the curing group is not related to backtracking, and can maximize the matching efficiency. 3.3 Conversion of non-greedy mode to greedy mode

When you use a subexpression that matches a larger range, the greedy pattern matches the contents of the non greedy pattern, but the greedy pattern can be implemented by optimizing the subexpression, which can be matched by a non greedy pattern.

For example, in practical applications, match the contents of the IMG tag.

Example:

Requirements: Get the image address in the IMG tag, src= fixed to "" "

SOURCE string:

Regular expression One:

In the match result, the capture Group 1 content is the picture address. As you can see, this example uses a non-greedy pattern, and according to the analysis in the previous section, the next two non-greedy patterns can use the exclusion character group to convert the non-greedy pattern to greedy mode.

Regular Expression II: ]*>

Note: the character ">" may also appear in the properties between "src=" and ">" of the tag end tag, but that is an extreme case, which is not discussed here.

The latter two are not greedy patterns, it is possible to convert the exclusion character group to greedy mode to improve the efficiency of the match, whereas the src= mode before "src=" is not allowed to use excluded character groups because of the exclusion of a sequence of characters, rather than a single or several characters. Of course, there is no way, you can use a sequential look to achieve this effect.

Regular expression three: ]*>

“(?! src=). " Represents such a character, starting from it, the right cannot be the character sequence "src=" and "(?:(?! src=).) * "means that there are 0 or infinitely many characters that conform to the rules above. This achieves the goal of excluding character sequences, with the same effect as an excluded character group, except that the excluded character group excludes one or more characters, which excludes one or more ordered sequences of characters.

However, in order to look at the way to exclude the sequence of characters, because in matching each character, should be more judgments, so relative to the non-greedy mode, is to improve efficiency or reduce efficiency, according to the actual situation to analyze. For simple regular expressions, or simple source strings, it is generally not greedy to be efficient, and for a large number of source strings, or complex regular expressions, greedy patterns are generally more efficient.

For example, the above obtained IMG tag in the image address needs, basically with regular expression two can be; for complex applications, such as the balance group, you need to use the greedy pattern of a combined look.

Take a balanced group that matches nested div tags as an example:

Regex reg = new regex (? ISX) #匹配模式, ignoring case, "." Match any character

<div[^>]*> #开始标记 "<div ... > "

(?> #分组构造, used to qualify the quantifier "*" Cosmetic range

<div[^>]*> (?<open>) #命名捕获组, encountered the start tag, into the stack, Open count plus 1

| #分支结构

</div> (?<-open>) #狭义平衡组, encountered end tag, out stack, Open count minus 1

| #分支结构

(?:(?! </?div\b).) * #右侧不为开始或结束标记的任意字符

) * #以上子串出现0次或任意多次

(? (Open)               (?!)) #判断是否还有 ' OPEN ', it means no pairing, nothing matches.

</div> &NB

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.