The greed, reluctance and encroachment of Java regular expressions

Last Update:2015-08-13 Source: Internet

Author: User

Tags character classes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Java Regular expressions, quantifiers (quantifiers) allow you to specify the number of occurrences of a match, for convenience, under the current Pattern API specification, describe greed, reluctance, and embezzlement of three quantifiers. First glance, quantifiers X? , X?? and X?+ all allow matching X 0 or one time, exactly the same thing, but there are subtle differences between them.

Type of quantifier			Significance
Greed	Barely	Occupation	Significance
`X?`	`X??`	`X?+`	Match X 0 or one time
`X*`	`X*?`	`X*+`	Match X 0 or more times
`X+`	`X+?`	`X++`	Match X one or more times
`X{n}`	`X{n}?`	`X{n}+`	Match X N Times
`X{n,}`	`X{n,}?`	`X{n,}+`	Match X at least n times
`X{n,m}`	`X{n,m}?`	`X{n,m}+`	Match X at least n times, but not more than m times

Before you begin, prepare a piece of code that you can test repeatedly:

Import Java.util.scanner;import Java.util.regex.matcher;import Java.util.regex.pattern;public class RegexDemo { public static void Main (string[] args) {Scanner sc = new Scanner (system.in), while (true) {System.out.print ("\nregex:"); Pattern pattern = Pattern.compile (Sc.nextline ()); System.out.print ("String to Search:"); Matcher Matcher = Pattern.matcher (Sc.nextline ()); Boolean found = False;while (Matcher.find ()) {System.out.println ("  Found the text \ "" + matcher.group () + "\" starting at index "+ matcher.start () +" and ending at index "+ matcher.end () + "."); Found = true;} if (!found) {System.out.println ("No match found.");}}}

Starting with greedy quantifiers, build three different regular expressions: followed by letters a ? , * and + . Next, take a look at what happens when you use these expressions to test the input string as an empty string:

Regex:a?  "" Starting at index 0 and ending at index 0.

regex:a*

Found the text "" starting at index 0 and ending at index 0.

regex:a+

No match found.

0 length Matching

In the above example, the first two matches are successful because the expression a? and a* both allow the characters to appear 0 times. For now, this example is not like the others, and you may notice that the start and end indexes are all 0. The empty string entered does not have a length, so the test simply matches nothing on index 0, and such matches are called 0-length matching (zero-length-matches). 0 length matching occurs when you enter an empty string, at the beginning of the input string, after the last character of the input string, or between any two characters in the input string. Because they have the same index in their starting and ending positions, 0-length matching is easy to find.
Let's take a look at more examples of zero-length matching. Change the input string to a single character "a" and you'll notice something interesting:

Regex:a?  "A" starting at index 0 and ending at index 1"" Starting at index 1 and ending at index 1. Regex:a*"A" starting at index 0 and ending at index 1"" Starting at index 1 and ending at index 1
   
    . Regex:a+
    "A" starting at index 0 and ending at index 1.

All three quantifiers are used to look for the letter "a", but the first two find a 0-length match at index 1, that is, after the last character of the input string. Recall that the match considers the character "a" to be in the cell between index 0 and index 1, and the test appliance loops until there is no longer a match. Depending on the quantifier used, the existence of the index after the last character "nothing" can or may not trigger a match.
Now when you change the input string to a line of 5 "a", you get the following result:

Regex:a?String to search:aaaaafound the text"A" starting at index 0 and ending at index 1. Found the text"A" starting at index 1 and ending at index 2. Found the text"A" starting at index 2 and ending at index 3. Found the text"A" starting at index 3 and ending at index 4. Found the text"A" starting at index 4 and ending at index 5. Found the text"" Starting at index 5 and ending at index 5. Regex:a*String to search:aaaaafound the text"AAAAA" starting at index 0 and ending at index 5. Found the text"" Starting at index 5 and ending at index 5. Regex:a+String to search:aaaaafound the text"AAAAA" starting at index 0 and ending at index 5.

When "a" appears 0 or one time, the expression a? looks for each character that matches. The expression a* found two separate matches: the first match to all the letters "a", and then the match to the last character after the index 5. Finally, a+ all occurrences of the letter "a" are matched, ignoring the existence of "nothing" at the last index.
Here, you may wonder what happens to the two quantifiers that begin when they encounter the letters "a". For example, what happens when you encounter the letter "B" in ""?
Here's a look at:

Regex:a?String to search:ababaaaabfound the text"A" starting at index 0 and ending at index 1. Found the text"" Starting at index 1 and ending at index 1. Found the text"A" starting at index 2 and ending at index 3. Found the text"" Starting at index 3 and ending at index 3. Found the text"A" starting at index 4 and ending at index 5. Found the text"A" starting at index 5 and ending at index 6. Found the text"A" starting at index 6 and ending at index 7. Found the text"A" starting at index 7 and ending at index 8. Found the text"" Starting at index 8 and ending at index 8. Found the text"" Starting at index 9 and ending at index 9. Regex:a*String to search:ababaaaabfound the text"A" starting at index 0 and ending at index 1. Found the text"" Starting at index 1 and ending at index 1. Found the text"A" starting at index 2 and ending at index 3. Found the text"" Starting at index 3 and ending at index 3. Found the text"AAAA" starting at index 4 and ending at index 8. Found the text"" Starting at index 8 and ending at index 8. Found the text"" Starting at index 9 and ending at index 9. Regex:a+String to search:ababaaaabfound the text"A" starting at index 0 and ending at index 1. Found the text"A" starting at index 2 and ending at index 3. Found the text"AAAA" starting at index 4 and ending at index 8.

Even though the letter "B" appears in cells 1, 3, 8, the output at these locations reports a 0-length match. The regular expression is a? not deliberately looking for the letter "B", it is simply to find the letter "a" exists or is missing. If a quantifier is allowed to match "a" 0 times, any input characters that are not "a" will be matched as 0 lengths. In the previous example, a is guaranteed to be matched according to the rules discussed.
For precisely matching a pattern n times, you can simply specify a value within a pair of curly braces:

Regex:a{3}string to Search:aano match found. regex:a{3"AAA" starting at index 0 and ending at index 3. regex:a{3"AAA" starting at index 0 and ending at index 3.

Here, the regular table determines the a{3} letter "A" that appears in a row for three consecutive times. The first test failed because the input string did not have enough a to match, the second Test output string exactly includes three "a", triggered a match, the third Test also triggered a match, because the output of the string in the beginning of the first part of the exact three "a". The next thing is irrelevant to the first match, and if the pattern continues to appear after this, it will trigger the next match:

Regex:a{3"AAA" starting at index 0 and ending at index 3"AAA" starting at index 3 and ending at index 6< c7>"AAA" starting at index 6 and ending at index 9.

For a pattern to occur at least n times, you can add a comma () after this number , :

Regex:a{3"AAAAAAAAA" starting at index 0 and ending at index 9.

Entering the same string, this test only finds a match, because nine "a" in one satisfies the requirement of "at least" three "a". Finally, for a specified maximum number of occurrences, you can add a second number in curly braces.

regex:a{3,6"aaaaaa" starting at index 0 and ending at index 6"AAA" starting at index 6 and ending at IND Ex 9.

Here, the first match was forced to terminate at the upper limit of 6 characters. The second match contains the remaining three A (which is the minimum number of characters allowed for the match). If you enter a string that is less than one letter, there will be no second match, and then only two A is left.

Quantifiers in capturing groups and character classes

So far, only the input string has been tested to include a single character quantifier. In fact, a quantifier may only be appended to one character at a time, so abc+ the regular expression means "a
Followed by B, then one or more times C ", it does not mean abc one or more times. However, quantifiers may also be appended to the character class and the capturing group, for example, to [abc]+ denote one or more
A or B or C, which (abc)+ represents one or more "ABC" groups.
Let's specify that (dog) the group is described three times in a row.

Regex: (dog) {3"Dogdogdog" starting at index 0 and ending at index 9"Dogdogdog" starting at index 9 and EN Ding at index 18.

The first example above finds three matches, because the quantifier is used on the entire capturing group. However, by removing the parentheses, the quantifier is {3} now used only on the letter "G", which causes the match to fail. Similarly, quantifiers can be applied to the entire character class:

Regex:[abc]{3"ABC" starting at index 0 and ending at index 3"Cab" starting at index 3 and ending at Inde X 6"AAA" starting at index 6 and ending at index 9"CCB" starting @ Index 9 and ending at index
     "BBC" starting at index, and ending at index 15.

The difference between greed, reluctance and the appropriation of quantifiers

There is a slight difference between greed, reluctance and the appropriation of three quantifiers.
Greedy quantifiers are called "greedy" because they force the match to read (or eat) the entire input string, to prioritize the first match, and if the first attempt to match (for the entire input string) fails, the match is tried again by a character that rolls back the entire string, Continue to process until a match is found, or there are no more characters on the left for fallback. Depending on the quantifier used in the expression, eventually it will try to match 1 or 0 characters.
However, a reluctant quantifier takes the opposite approach: starting at the beginning of the input string, so that each time it is forced to devour a character to find a match, eventually they try the entire input string.
Finally, the overrun quantifier is always a string that swallows the entire input, trying to match once (only once). Unlike greedy quantifiers, the appropriation quantifier will never fall back, even if it is allowed to complete the match successfully.
To illustrate, look at the input string when it is Xfooxxxxxxfoo.

regex:.*"Xfooxxxxxxfoo" starting at index 0 and ending at index. Regex:. *?  "Xfoo" starting at index 0 and ending at index 4"Xxxxxxfoo" starting at index 4 and ending at index 13
   
    . Regex:. *+
    foostring to Search:xfooxxxxxxfoono match found.

The first example uses greedy quantifiers .* to look for "anything" that follows the letter "F" "O" "o" 0 or more times. Because the quantifier is greedy, the part of the expression .* "eats" the entire input string for the first time. At this point, all expressions cannot be successfully matched because the last three letters ("F" "O" "O") have been consumed. The match will slowly fall back one letter at a time until the returned "foo" appears on the far right, when the match succeeds and the search terminates.
However, the second example uses a quantifier, so it starts by consuming "nothing" for the first time. Since "foo" does not appear at the beginning of a string, it is forced to swallow the first letter ("X"), triggering the first match at 0 and 4. The test appliance will continue processing until the input string is exhausted. Another match was found in 4 and 13.
The third example of the quantifier is the appropriation, so the search for a match failed. In this case, the entire input string is .*+ consumed and nothing is left to satisfy the "foo" at the end of the expression.
You can use an appropriation quantifier when you want to crawl everything and never fall back, and it will be better than the equivalent greedy quantifier if the match is not immediately discovered.

The greed, reluctance and encroachment of Java regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More