Some advanced rules in Regular Expressions

Source: Internet
Author: User
2. Greedy and non-greedy in regular expressions with some advanced rules 2.1 matching times

When you use a special symbol to modify the number of matches, there are several representation methods that allow the same expression to match different times, such as: "{m, n}", "{M ,} ","? "," * "," + ", The number of matching times varies with the string to be matched. This type of expressions with an indefinite number of repeat matches as many as possible during the matching process. For example, for the text "dxxxdxxxd", the example is as follows:

Expression

Matching result

(D) (\ W +)

"\ W +" will match all characters after the first "D" "xxxdxxxd"

(D) (\ W +) (d)

"\ W +" will match all characters "xxxdxxx" between the first "D" and the last "D ". Although "\ W +" can match the last "D", to make the entire expression match successfully, "\ W +" can "let out" the last "D" that can be matched"

It can be seen that "\ W +" always matches as many characters as possible to comply with its rules. Although the second example does not match the last "D", it is also used to make the entire expression match successfully. Similarly, the expressions with "*" and "{m, n}" Both match as much as possible, "? "When the expression can be matched but not matched, it is also" to match "as much as possible ". This matching principle is called the "greedy" pattern.

Non-Greedy mode:
Add "? "Number, the number of matching expressions can be as few as possible, so that the non-matching expressions can be matched, as far as possible" not matching ". This matching principle is called "non-greedy" mode, or "barely" mode. If there is a small match, the entire expression will fail to match. Similar to greedy mode, non-Greedy mode will be matched to a minimum to make the entire expression match successful. For example, for the text "dxxxdxxxd:

Expression

Matching result

(D) (\ W + ?)

"\ W +? "Match as few characters as possible after the first" D ". The result is:" \ W +? "Only matches one" X"

(D) (\ W + ?) (D)

To make the entire expression match successfully, "\ W +? "Must match" XXX "to make the" D "behind the expression match, so that the entire expression matches successfully. Therefore, the result is: "\ W +? "Matching" XXX"

For more information, see the following example:
Example 1: expression "<TD> (. *) </TD> "match with string" <TD> <p> AA </P> </TD> <p> BB </P> </TD> ", the matching result is successful; the matched content is "<TD> <p> AA </P> </TD> <p> BB </P> </TD>" the entire string, the "</TD>" in the expression matches the last "</TD>" in the string.
Example 2: In contrast, the expression "<TD> (.*?) </TD> "when matching the same string in Example 1, only" <TD> <p> AA </P> </TD> "is obtained. When matching the Next string again, you can get the second "<TD> <p> BB </P> </TD> ".

2.2 reverse reference \ 1, \ 2...

When an expression matches, the expression engine records the matching strings of the expressions contained in parentheses. When obtaining the matching result, strings matching the expressions in parentheses can be obtained separately. This is already shown in the previous example. In actual application scenarios, when a boundary is used for search and the obtained content does not contain the boundary, parentheses must be used to specify the desired range. For example, the preceding "<TD> (.*?) </TD> ".
In fact, "the string matched by the expression contained in parentheses" can be used not only after the matching is complete, but also during the matching process. The part behind the expression can be referenced in the preceding section "matching matched strings with subscripts in parentheses ". The reference method is to add a number. "\ 1" references the string matching 1st pairs of brackets, and "\ 2" references the string matching 2nd pairs of brackets ...... Similarly, if a pair of parentheses contains another pair of parentheses, the outer brackets are sorted first. In other words, Which pair of left parentheses "(" in front, then this pair is sorted first.

Example:
Example 1: expression "('| ")(.*?) (\ 1) "When" 'hello', "world" "is matched, the matching result is successful, and the matching content is:" 'hello '". You can match "world" when matching the next one again "".
Example 2: When the expression "(\ W) \ 1 {4,}" matches "aa bbbb abcdefg CCCCC 111121111 999999999", the matching result is successful; the matched content is "CCCCC ". If you match the next one, 999999999 is returned. This expression requires that the characters in the "\ W" range be repeated at least five times. Note the difference between the expressions and "\ W {5.
Example 3: expression "<(\ W +) \ s * (\ W + (= ('| ").*? \ 4 )? \ S *) *> .*? </\ 1> "when matching" <TD id = 'td1 'style = "bgcolor: White"> </TD> ", the matching result is successful. If "<TD>" and "</TD>" are not paired, the matching fails. If it is changed to another pair, the matching succeeds.

2.3 Pre-search, unmatched; reverse pre-search, unmatched

In the previous chapter, I mentioned several special symbols that represent abstract meanings: "^", "$", "\ B ". They all have one thing in common, that is, they do not match any character, but they only append a condition to "the two ends of the string" or "the gap between characters. After understanding this concept, this section will continue to introduce another more flexible expression method that adds conditions to "two ends" or "gaps.

Forward pre-Search :"(? = XXXXX )","(?! XXXXX )"
Format :"(? = XXXXX) ", in the matched string, it attaches the condition to the" gap "or" two ": the right side of the gap, must be able to match the expression above XXXXX. Because it is only used as a condition attached to this gap, it does not affect the following expression to truly match the character after this gap. This is similar to "\ B" and does not match any character. "\ B" only determines the characters before and after the gap, and does not affect the true matching of the following expressions.
Example 1: expression "windows (? = Nt | XP) "only" Windows "in" Windows NT "matches" Windows 98, Windows NT, and Windows 2000 ", other words "Windows" are not matched.
Example 2: expression "(\ W )((? = \ 1 \ 1 \ 1) (\ 1) + "when matching string" AAA ffffff 999999999 ", the first four of the six" F "can be matched, it can match the first 7 of the 9 "9. This expression can be interpreted as: Repeat more than four letters and numbers to match the remaining two digits. Of course, this expression can not be written in this way. The purpose of this expression is to be used for demonstration.

Format :"(?! XXXXX) ", the right side of the gap must not match the expression of XXXXX.
Example 3: expression "((?! \ Bstop \ B ).) + "when matching" fdjka ljfdl stop fjdsla FDJ ", the entire string is matched from the beginning to the position before" stop ". If the string does not contain" stop ", the entire string is matched.
Example 4: expression "Do (?! \ W) "only" do "can be matched when the string" done, do, dog "is matched ". In this example, "do" is used later "(?! \ W) "is the same as" \ B.

Reverse pre-Search :"(? <= XXXXX )","(? <! XXXXX )"
The concepts of these two formats are similar to those of forward pre-search. The condition for reverse pre-search is: "Left" of the gap ", the two formats must be able to match and must not match the specified expression, rather than determining the right side. Like "Forward pre-search", they are all additional conditions for the gap and do not match any characters.
Example 5: expression "(? <= \ D {4}) \ D + (? = \ D {4}) "When matching" 1234567890123456 ", it will match the middle eight digits except the first four digits and the last four digits. As JScript. Regexp does not support reverse pre-search, this example cannot be demonstrated. Many other engines Support reverse pre-search, such as Java 1.4 or above. util. regEx package ,. net System. text. regularexpressions namespace, as well as the deelx regular engine that is the most simple and easy to use on this site.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.