Greedy and non-Greedy pattern in regular expressions (Overview)

Source: Internet
Author: User

1 Overview
The greedy and non-Greedy modes affect the Matching Behavior of the Child expressions modified by quantifiers. The greedy mode matches as many expressions as possible on the premise that the entire expression matches successfully, the non-Greedy pattern matches as little as possible on the premise that the entire expression matches successfully. The non-Greedy mode is only supported by some NFA engines.

A greedy quantizer is also called a matching preference, including:

"{M, n}", "{m,}", "?" , "*" And "+ ".

In some languages that use the NFA engine, add "?" After matching the priority quantifiers, That is, it becomes a non-Greedy pattern quantizer, which is also called ignoring the priority quantifiers, including:

"{M, n }?" , "{M ,}?" , "?" , "*?" And "+ ?".

From the perspective of regular expressions, the Child expressions modified by matching the priority quantifiers use greedy patterns, such as "(Expression) + "; the child expressions modified by the ignored quantifiers use non-Greedy modes, such as "(Expression) +? ".

For the greedy mode, the names of various documents are basically the same, but for the non-Greedy mode, some are called the lazy mode or the inert mode, and some are called the stubborn mode. In fact, it doesn't matter what it is called, as long as you master the principles and usage, you can use them freely. I am used to calling greedy and non-Greedy terms, so I will introduce them in this article.

2. Principles of matching greedy and non-Greedy patterns
The greedy and non-Greedy modes can be understood from the application and principle perspectives, but if you want to grasp them, you still need to understand the matching principle.

From the application perspective, I would like to answer "what is the greedy and non-Greedy model ?"

2.1 greedy and non-Greedy models from the application perspective
2.1.1 what is greedy and non-Greedy mode?
Let's look at an example.

Example:

Source string: aa <div> test1 </div> bb <div> test2 </div> cc

Regular Expression 1: <div>. * </div>

Matching result 1: <div> test1 </div> bb <div> test2 </div>

Regular Expression 2: <div> .*? </Div>

Matching result 2: <div> test1 </div> (this indicates a matching result, so <div> test2 </div> is not included)

Based on the above example, we can analyze the Matching Behavior in what is the greedy and non-Greedy pattern.

Regular Expression 1 adopts the greedy mode. When matching the first "</div>", the entire expression can be matched successfully. However, the greedy mode is used, so we still need to try matching to the right to check whether there are longer substrings that can be matched successfully. After matching the second "</div>, no child string can be matched to the right. The match ends and the matching result is "<div> test1 </div> bb <div> test2 </div> ". Of course, the actual matching process is not like this. The matching principle will be detailed later.

From the perspective of the application, we can think that the greedy mode is to make as many matches as possible on the premise that the entire expression matches successfully, that is, the so-called "greedy". In other words, that is, you can find what you want, unless you don't want it any more.

Regular Expression 2 adopts the non-Greedy mode. When the first "</div>" is matched, the entire expression is matched successfully. Because the non-Greedy mode is used, the matching is ended, try again to the right. The matching result is "<div> test1 </div> ".

From the perspective of the application, we can think that the non-Greedy mode is to make as few matches as possible on the premise that the entire expression matches successfully, that is, the so-called "non-greedy ", in layman's terms, you just need to find a desired one, and you don't have to worry about it.

2.1.2 prerequisites
When analyzing the greedy and non-Greedy models from the application perspective, one of the prerequisites that has always been mentioned is that the entire expression matches successfully. Why do we need to emphasize this premise? Let's look at the following example.

Regular Expression 3: <div>. * </div> bb

Matching result 3: <div> test1 </div> bb

Modify ". "is still matching the priority quantizer" * ", so here is the greedy mode, the previous" <div>. * </div> "can still match" <div> test1 </div> bb <div> test2 </div> ", however, because the following "bb" cannot be matched successfully, "<div>. * </div> "the matched" bb <div> test2 </div> "must be obtained to make the entire expression match successfully. Then the matching result of the entire expression is "<div> test1 </div> bb" and "<div>. * </div> "the matched content is" <div> test1 </div> ". It can be seen that, on the premise of "the entire expression matches successfully", the greedy mode truly affects the Matching Behavior of the subexpression. If the entire expression fails to match, greedy mode only affects the matching process, and the impact on matching results cannot be discussed.

The non-Greedy mode also has the same problem. Let's look at the example below.

Regular Expression 4: <div> .*? </Div> cc

Matching Result 4: <div> test1 </div> bb <div> test2 </div> cc

The non-Greedy mode is used here. The preceding "<div> .*? </Div> "is still matched to" <div> test1 </div> ". At this time," cc "cannot be matched successfully." <div> .*? </Div> "must continue to the right until the Matching content is" <div> test1 </div> bb <div> test2 </div>, the following "cc" can be matched successfully. The entire expression is matched successfully. The matched content is "<div> test1 </div> bb <div> test2 </div> cc ", "<div>. *? </Div> "the matched content is" <div> test1 </div> bb <div> test2 </div> ". We can see that, on the premise of "the entire expression matches successfully", non-Greedy mode actually affects the Matching Behavior of the subexpression. If the entire expression fails to match, the non-Greedy mode cannot affect the Matching Behavior of subexpressions.

2.1.3 greedy or non-greedy-Application Selection
From the perspective of application analysis, we have basically understood the features of greedy and non-Greedy models. In actual application, do you choose greedy or non-Greedy models, this should be determined based on requirements.

For some simple requirements, for example, if the source character is "aa <div> test1 </div> bb", you can obtain the expected results by using the greedy and non-Greedy modes, which one does not have much to do.

However, in the example in 2.1.1, in actual application, only one matching div tag is required at a time, which is the content matched by the non-Greedy pattern, the content that the greedy pattern matches is generally not what we need.

Why is there a greedy pattern? From the application perspective, it is difficult to give satisfactory answers. Therefore, we need to analyze the greedy and non-Greedy pattern from the perspective of matching principles.

2.2 greedy and non-Greedy models from the perspective of matching principles
If you really want to know what the greedy mode is, what the non-Greedy mode is, when it is used, and how efficient it is, you cannot simply analyze it from the application perspective, it is necessary to fully understand the matching principles of greedy and non-Greedy patterns.

2.2.1 starting from the basic Matching Principle
For the basic NFA Engine Matching principles, see NFA Engine Matching principles.

This article mainly introduces the matching principles involved in the greedy and non-Greedy modes. Let's take a look at the simple matching process of greedy mode.

Source string: "Regex"

Regular Expression :".*"


-1

Note: In order to be able to see clearly the matching process, the gap above is large, and the actual source string is "" Regex ", the same below.

Let's take a look at the matching process. First, the first "" gets control, matches the "" with 0 digits, matches successfully, and gives control to ". *".

After ". *" gets control, because "*" is a matching priority quantizer, matching is given priority when matching is not matching. Start from the "R" at location 1 and try matching. The matching is successful. Continue to the right and match the "e" at location 2. The matching is successful. Continue to the right, the matching is successful until it matches "" at the end. Because it matches the end of the string, ". * "ends the match and gives control to the" at the end of the regular expression.

"After obtaining control, because it is already at the end of the string and the matching fails, the status of the forward lookup is available for backtracking, and the control is handed over to". * ", by". * "giving up a character, that is, the" at the end of the string, and then giving control to the "at the end of the regular expression, match "" at the end of the string by ". The match is successful.

In this case, the entire regular expression is matched successfully, and ". *" matches "Regex", and a backtracing is performed during the matching process.

Next, let's take a look at the simple matching process in non-Greedy mode.

Source string: "Regex"

Regular Expression :".*? "




-2

Let's take a look at the non-Greedy pattern matching process. First, the first "" gets control, matches "" with 0 digits, matches successfully, and gives control to ". *?".

". *?" After obtaining control, Is to ignore the priority quantifiers. In the case of matching or not matching, the priority attempts do not match. Because "*" is equivalent to "{0, it does not match any content. Try to ignore the match from position 1, that is, do not match any content, and give control to the last "of the regular expression.

After "gets control, it tries to match from position 1." "matches" R "at position 1. If the matching fails, it looks forward to the available status, control to ". *? ", By ". *?" Take a character, match the "R" at position 1, and then give the control to the "at the end of the regular expression.

After "gets control, it tries to match from position 2." "matches" e "at position 1. If the matching fails, it looks forward for the available status, repeat the above process until *?" Match "x", and then give the control to the "at the end of the regular expression.

After "gets control, it tries to match from position 6." "matches the last" "of the string, and the match is successful.

At this time, the entire regular expression matches successfully, where ". *?" The matching content is "Regex", and the matching process carries out five backtracking operations.

2.2.2 greedy or non-greedy-selection of matching efficiency
Through the analysis of the matching principle, we can see that when the matching is successful, the greedy mode performs less backtracking, And the Backtracking process requires the transfer of control, the matching efficiency is greatly reduced by giving out the matched content or unmatched content and re-trying the matching. Therefore, the greedy mode has an advantage over the non-Greedy mode in matching efficiency.

However, the example in 2.2.1 is just a simple application. When you see this, will there be such a question? Will the greedy pattern be more efficient than non-Greedy pattern matching? The answer is no.

Example:

Requirement: Obtain the substrings in two "" strings, which cannot contain "".

Regular Expression 1 :".*"

Regular Expression 2 :".*? "

Scenario 1: When the greedy mode matches more unwanted content, there may be more backtracking than the non-Greedy mode. For example, if The source string is "The word" Regex "means regular expression .".

Case 2: The greedy mode cannot meet the demand. For example, if The source string is "The phrase" regular expression "is called" Regex "for short .".

In Case 1, regular expression 1 adopts the greedy mode, ". * "will always match to the end position of the string, and the control will be handed over to the" at the end. After the match fails, the backtracing will be performed because the multi-Matching content "means regular expression. "far more than the content to be matched, so when using a regular expression, the matching efficiency will be lower than the non-Greedy mode using regular expression 2.

In Case 2, the regular expression matches "" regular expression "is called" Regex ", which does not meet the requirements. Naturally, there is no matching efficiency.

The above two situations are common. Is it true that the non-Greedy mode can only be used to meet the needs and take both efficiency into account? Of course not. According to the actual situation, changing the child expression modified by the matching preference can not only meet the needs, but also improve the matching efficiency.

Source string: "Regex"

Regular Expression 3: "[^"] *"

Let's take a look at the matching process of regular expression 3.


-3

First, the first "" gets control, matches "" with 0 digits, matches successfully, and gives control to "[^"] * ".

After "[^"] * "gets control, because" * "matches the priority quantifiers, matching is given priority when matching is not matching. Start from the "R" at location 1 and try matching. The matching is successful. Continue to the right and match the "e" at location 2. The matching is successful. Continue to the right, when "x" is matched, the match is successful, and then the ending "" is matched, the match fails and the control is handed over to "at the end of the regular expression.

After "gets control, it matches" "at the end of the string, and the match is successful.

In this case, the entire regular expression matches successfully. The content matched by "[^"] * "is" Regex ", and no backtracking is performed during the matching process.

The sub-expression modified by quantifiers is defined by the ". ", changed to the exclusion character group" [^ "]". The greedy mode is still used, which perfectly solves the demand and efficiency problems. Of course, because this matching process does not backtrack, you do not need to record the Backtracking status. In this way, you can use the solidified group to further optimize the regular expression.

Regular Expression 4: "(?> [^ "] *)"

Not all languages Support solidified groups, such.. NET support, but Java does not support it. However, in Java, you can use a simpler preemptible keyword instead: "[^"] * + ".

3. Greedy or non-Greedy mode -- repeat matching efficiency
In general, in greedy and non-Greedy modes, if the Sub-expressions modified by quantifiers are the same, such as ". *" and ". *?", Their application scenarios are usually different, so the efficiency is generally not comparable.

For a subexpression that modifies the quantifiers to meet the requirements, for example, * "changed to" [^ "] *". Because the modified subexpressions are different, they are not directly comparable. However, when the same subexpression can meet the requirements, such as "[^"] * "and" [^ "] *?", The matching efficiency of greedy pattern is usually higher.

At the same time, the fact is that the non-Greedy mode can be implemented, and the greedy mode of the sub-expressions modified by the optimized quantifiers can be implemented, while the greedy mode can achieve some optimization effects, but not necessarily non-Greedy mode.

Another advantage of greedy mode is that when a matching fails, greedy mode can report failures more quickly to improve matching efficiency. The following describes the matching efficiency of greedy and non-Greedy models.

3.1 efficiency improvement-Evolution Process
After learning about the basic principles of matching the greedy and non-Greedy patterns, let's take a look at the evolution process of improving the regular expression efficiency.

Requirement: Obtain the substrings in two "" strings, which cannot contain "".

Source string: The phrase "regular expression" is called "Regex" for short.

Regular Expression 1 :".*"

The regular expression matches "" regular expression "is called" Regex ", which does not meet the requirements.

Put forward Regular Expression 2 :".*? "

First, "gets control, starts from the position 0 to try to match, until the match at Location 11 is successful, and the control is handed over to". *? ", The matching process is the same as that of the non-Greedy pattern in 2.2.1. ". *?" The matching content is "Regex", and the matching process carries out four backtracking operations.

How to eliminate the loss of matching efficiency caused by Backtracking is to use a smaller range of subexpressions and adopt greedy mode to propose regular expression 3: "[^"] *"

First, "gets control, starts from the position 0 to try to match, until the match at Location 11 is successful, and the control is handed over to" [^ "] *", the matching process is the same as that of the non-Greedy mode in section 2.2.2. "[^"] * "Is matched with" Regex ", and no backtracking is performed during the matching process.

3.2 Efficiency Improvement-faster report failure
The above discussion is the evolution of successful matching. For a regular expression, if the matching fails, if the matching fails to be reported as quickly as possible, the matching efficiency will be improved, this is perhaps the easiest thing to ignore when designing regular expressions. If the source string contains a large amount of data or the regular expression is complex, whether the matching failure can be quickly reported will directly affect the matching efficiency.

The following is a regular expression that fails to match and analyzes the matching process.

In The following matching process analysis, The source string is The same: The phrase "regular expression" is called "Regex" for short.

3.2.1 Non-Greedy pattern matching Failure Process Analysis

-1

Create a regular expression for non-Greedy mode that fails to match :".*? "@

Because of the existence of the final "@", this regular expression must eventually fail to match, so let's take a look at the matching process.

First, "" is used to obtain control, and the matching attempt starts from position 0. The matching fails until the matching at position A is successful, and the control is handed over to ". *?".

". *?" After obtaining control, the system starts matching at the position after A. Because it is not greedy, the system first ignores the matching and gives the control to "", and records the Backtracking status. After "gets control, it starts to try matching at the position after A. The matching character" r "fails, searches for the status that can be traced back, and gives control to". *? ", By ". *?" Match the character "r ". Repeat the above process until ". *?" Match the character "n" before "B", "match the character" at "B", and give control to "@". "@" Matches the following space "". If the matching fails, you can find the backtracing status and hand over the control to ". *?". By ". *?" Match spaces. Repeat the above matching process until ". *?" Match to the end position of the string and give control to "". Matching failed because it is already the end position of the string. It is reported that the entire expression fails to match at Location 11, and a round of matching attempts ends.

The regularizedengine drive enables the regularizeddrive to go to the next round. The subsequent matching process is similar to that in the first round. For details, refer to-1.

From the matching process, we can see that almost every step of the non-Greedy pattern matching failure process is accompanied by a backtracking process, which has a great impact on the matching efficiency.

3.2.2 Analysis of greedy pattern matching Failure Process-wide range of subexpressions


-2

PS: For the above analysis process illustration, see the related chapter illustration in "proficient in regular expressions.

Create a regular expression for the greedy pattern that fails to match :".*"@

The sub-expression modified by the quantifiers is the "." With a large matching range. Due to the existence of the final "@", this regular expression must eventually fail to match. Let's take a look at the matching process.

First, "" is used to obtain control, and the matching attempt starts from position 0. The matching fails until the matching at position A is successful, and the control is handed over to ". *".

". * "After obtaining control, A tries to match from the position after A. Because it is greedy mode, the optimization tries to match, always matches to the end position of the string, and gives the control to" ". "After obtaining control, because it is already the end position of the string, the matching fails, find the available status, and hand over the control to". * ", by". * "Give a matched character". ". Repeat the above process until "matches the character" next to "C" and gives control to "@". "@" Matches the space "" at the next D. If the matching fails, you can find the backtracing status. The control is ". *" and ". *" gives the matched text. Repeat the above matching process until ". *" gives all matched texts to I and gives control to "". "Failed to match. Because there is no backtracing status, it is reported that the entire expression fails to match at Location 11, and a round of matching attempts ends.

The regularizedengine drive enables the regularizeddrive to go to the next round. The subsequent matching process is similar to that in the first round. For details, refer to-2.

From the matching process, we can see that the matching failure process of the greedy mode of a large range of subexpressions is basically the same as that of the non-Greedy mode, the number of backtracing performed in the end is basically the same as that in non-Greedy mode, which still has a significant impact on matching efficiency.

3.2.3 greedy pattern matching Failure Process Analysis-improved subexpression

-3

Create a regular expression for the greedy pattern that fails to match: "[^"] * "@

Here, the sub-expression modified by quantifiers is changed to "[^"] "in the excluded character group with a small matching range. Because of the existence of the last, this regular expression must eventually fail to match. Let's take a look at the matching process.

First, "" is used to obtain control, and the matching attempt starts from position 0. The matching fails until the matching at position A is successful, and the control is handed over to "[^"] * ".

After "[^"] * "gets control, it starts to try matching at the position after A. Because it is greedy, it first tries matching and always matches at B, give control to "". "Matches the subsequent character" ". If the match succeeds, the control is handed over to" @ ". Match the following space by "@". If the matching fails, find the status that can be traced back, and give the control to "[^"] * ". "[^"] * "indicates the matched text. Repeat the above matching process until "[^"] * "gives all matched texts to C and gives control to" ". "Failed to match. Because there is no backtracing status, it is reported that the entire expression fails to match at Location 11, and a round of matching attempts ends.

The regularizedengine drive enables the regularizeddrive to go to the next round. The subsequent matching process is similar to that in the first round. For details, refer to-3.

We can see from the matching process that the greedy mode of the excluded character group fails to be matched. In general, the number of backtracking times is greatly reduced, this effectively improves the matching efficiency.

3.2.4 greedy pattern matching Failure Process Analysis-solidified grouping
According to the analysis in section 3.2.3, since "[^"] * "uses an excluded character group, the characters that are matched between A and B in-3 are displayed, it won't be the character "", so the Backtracking process between B and C is redundant. That is to say, the traceable state between B and C is completely retained .. NET can be used as a fixed group, and Java can use a preference to achieve this effect.


-4

First, "" is used to obtain control, and the matching attempt starts from position 0. The matching fails until the matching at position A is successful, and the control is handed over to "(?> [^ "] *)".

"(?> [^ "] *)" After obtaining control, the system starts to try matching at the position after A. Because it is in greedy mode, the system first tries matching and keeps matching at B, the control is handed over to "". During this matching process, no backtracing status is recorded. "Matches the subsequent character" ". If the match succeeds, the control is handed over to" @ ". "@" Matches the following space "". If the matching fails, you can find the available backtracing status. Because there is no available backtracing status, it is reported that the entire expression fails to match at Location 11, the round of matching attempts ends.

The regularizedengine drive enables the regularizeddrive to go to the next round. The subsequent matching process is similar to that in the first round. For details, refer to-4.

From the matching process, we can see that the matching failure process using the greedy pattern of the solidified group does not involve backtracking, which can maximize the matching efficiency.

3.3 conversion from non-Greedy mode to greedy Mode
When a child expression with a large matching range is used, the content matching the greedy mode and the non-Greedy mode will be different, but the matching can be achieved by optimizing the child expression and non-Greedy mode, the greedy mode can be implemented.

For example, in actual application, match the content of the img tag.

Example:

Requirement: Obtain the image address in the img label. The value is "" After src =.

Source string:

Regular Expression 1:

In the matching result, the content of capture group 1 is the image address. We can see that all of the above examples use the non-Greedy mode. Based on the analysis in the previous chapter, the two non-Greedy modes can use the excluded character group, convert non-Greedy mode to greedy mode.

Regular Expression 2: ] *>

Note: "src = "... The character ">" may also appear in the attribute between "" and the tag end identifier ">", but this is an extreme situation and will not be discussed here.

The last two non-Greedy modes can be converted to greedy mode by means of excluded character groups to improve matching efficiency, while the non-Greedy mode before "src =, you cannot use an exclusive character group because it is a character sequence "src =" instead of a single character group or a few characters. Of course, there is no way. You can use sequential view to achieve this effect.

Regular Expression 3: ] *>

"(?! Src =). "indicates such a character. Starting from it, the right side cannot be the Character Sequence" src = ", and" (? :(?! Src =).) * "indicates the characters that match the preceding rules. There are 0 or unlimited characters. In this way, the purpose of excluding character sequences is achieved, and the effect is the same as that of excluded character groups, except that the excluded character groups exclude one or more characters, this kind of view structure does not exclude one or more sequential character sequences.

However, to exclude character sequences in the form of sequential surround view, because many judgments are required when matching each character, compared with the non-Greedy mode, the efficiency is improved or reduced, analyze Data Based on actual conditions. For simple regular expressions or simple source strings, the non-Greedy mode is generally more efficient. for a large number of source strings or complex regular expressions, in general, the greedy mode is more efficient.

For example, the above requirement for obtaining the image address in the img label can be basically expressed using regular expressions. For complex applications, such as the balance group, the greedy mode combining the environment view needs to be used.

Take the balanced group matching the nested div label as an example:

Regex reg = new Regex (@"(? Isx) # matching mode, case insensitive, "." matches any character

<Div [^>] * ># start to mark "<div...>"

(?> # Grouping structure, used to limit the modifier range of the quantizer "*"

<Div [^>] *> (? <Open>) # name the capture group. When the Start mark is displayed, the Open count is added to the stack.

| # Branch Structure

</Div> (? <-Open>) # In a narrow sense, if the balance group encounters an ending mark, the number of outgoing Stacks is reduced by 1.

| # Branch Structure

(? :(?! </? Div \ B).) * # No start or end mark on the right

) * # The above substrings appear 0 times or any number of times

(? (Open )(?!)) # Determine whether there is still 'open'. If yes, it indicates that it is not paired and does not match anything.

</Div> # End mark "</div>"

");

"(? :(?! </? Div \ B ).) * "here we use the greedy pattern of the loop view. Although each character has to make many judgments, this judgment is based on characters and is fast, if we use the non-Greedy mode here, we need to determine the branch structure "|" each time, and the branch structure greatly affects the matching efficiency, the cost is much higher than the determination of the characters. Another reason is that the greedy mode can be combined with the solidified group to improve efficiency, but it does not make sense to use the solidified group in non-Greedy mode.

4. Greed and non-Greed-last review
4.1 Review of matching principles in one example
Let's look back at the regular expression in section 2.1.1, Which is analyzed from the application perspective. However, after discussing the matching principle, we will find that the matching process is not that simple, the following describes the matching process from the perspective of matching principle.


-1

First, "<" is used to obtain control. The match starts from the position 0 and matches the character "a". The match fails and the first round of matching ends. The second round of matching starts from position 1 and the matching fails. The third round starts from position 3 and matches the character "<". The match is successful and the control is handed over to "d ".

"D" tries to match the character "d", the match is successful, and the control is handed over to "I ". Repeat the above process until ">" matches the character ">", and the control is handed over to ". *".

". *" Is greedy. It matches the character "t" after "B" to "E", that is, the end position of the string, and gives control to "<".

"<" Attempts to match from the end position of the string. If the match fails, search for the backtracing status and give control to ". * ", by". * "giving up a character" c ", giving control to" <", trying to match, matching failure, and looking forward for traceable status. Repeat the above process until ". * "let out the matched character" <", that is, let out the matched substring" </div> cc, "<" matches the character "<", and the control is handed over to "/".

Then, the corresponding characters are matched by "/", "d", "I", and "v" respectively. At this time, the entire regular expression is matched.

4.2 greedy and non-greedy-quantifiers
4.2.1 non-Greedy mode of interval quantifiers
The non-Greedy mode mentioned above has always been "*?", Other interval quantifiers are not involved. And "+ ?" In this non-Greedy mode, most people who have been familiar with regular expressions can understand it, but for the non-Greedy mode of interval quantifiers, such as "{m, n }?", Either I have never seen it, or I have never understood it. The main reason is that this kind of application scenario is very small, so it is ignored.

The first thing to be clear is that the quantizer "{m, n}" matches the priority quantifiers. Although it has an upper limit, it can be matched before it reaches the upper limit, try to match as many as possible. And "{m, n }?" The corresponding priority quantifiers are ignored. When the matching is not matching, try to match as few as possible.

The following example shows the application of this non-Greedy mode.

Example (refer to the limit character length and minimum matching ):

Requirement: How to restrict the abc from first matching to first appearing in a string with a length of 100

The write in csdn. {1,100} abc is the maximum match (I need the smallest among 1-strings)

For example, csdnfddabckjdsfjabc, the matching result should be: csdnfddabc

Regular Expression: csdn. {1,100 }? Abc

Some people may not quite understand this example, but think about it. In fact, "*" is equivalent to "{0,}", and "+" is equivalent to "{1 ,}", "*?" That is, "{0 ,}?", Abstracted, that is, "{m ,}?", That is, the upper limit is infinite. If the upper limit is a fixed value, it is "{m, n }?", This should be understandable.

"{M}" is not included in the matching priority quantifiers. Similarly, "{m }?" Although supported by some languages, they are not ignored because the two quantifiers achieve the same effect, only after the modified Sub-expression matches m times can the match be successful, and there is no backtracing status. Therefore, there is no issue of priority in matching or ignoring, this is not covered in this article. In fact, even the discussion is meaningless, as long as you know their matching behavior.

4.2.2 ignore the lower limit of priority quantifiers
It is easy to understand the lower limit of matching priority quantifiers, "?" It is equivalent to "{0, 1}". The child expression it modifies matches at least 0 times and at most 1 time. "*" is equivalent to "{0 ,}", the child expression it modifies. It matches at least 0 times and at most Infinitely multiple times. "+" is equivalent to "{1,}". The child expression it modifies matches at least 1 time, it can be matched Infinitely multiple times at most.

It is easy to ignore the lower limit of the Priority quantifiers.

"?" It also ignores the priority quantifiers. The modified subexpression uses a non-Greedy pattern, "?" The child expression to be modified. It must be matched at least 0 times and at most 1 time. In the matching process, the non-Greedy pattern matching principle is followed. First, no matching is performed, that is, 0 matches are matched, and the tracing status is recorded. Only when the matching is required can the matching be attempted.

"*?" The modified Sub-expression can be matched at least 0 times and at most Infinitely multiple times. "+ ?" The child expression that is modified. It must be matched at least once and at most Infinitely multiple times. "+ ?" Although the Non-Greedy mode is used, you must first match a character during the matching process, and then ignore the matching.

4.3 Summary of greedy and non-Greedy Models
Ø greedy and non-greedy in terms of syntax

The greedy mode is used for the Child expressions modified by the matched priority quantifiers. The child expressions modified by the ignored priority quantifiers use the non-Greedy mode.

Matching priority quantifiers include: "{m, n}", "{m,}", "?" , "*" And "+ ".

The ignored priority quantifiers include: "{m, n }?" , "{M ,}?" , "?" , "*?" And "+ ?".

From the application perspective, greedy and non-greedy

The greedy and non-Greedy modes affect the Matching Behavior of the Child expressions modified by quantifiers. The greedy mode matches as many expressions as possible on the premise that the entire expression matches successfully; the non-Greedy pattern matches as little as possible on the premise that the entire expression matches successfully. The non-Greedy mode is only supported by some NFA engines.

From the perspective of matching principles, greedy and non-greedy

The greedy and non-Greedy modes that can match the same results are usually more efficient.

All non-Greedy modes can be converted to greedy modes by modifying the sub-expressions modified by quantifiers.

Greedy mode can be combined with Solidified groups to improve matching efficiency, but not greedy mode.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.