Comprehensive Analysis of Linux Regular Expressions (4)

Source: Internet
Author: User
Tags character classes expression engine

Subpatterns of Linux regular expressions. The previous article on Back references should be described in detail, and a part of it introduces quantifiers ), greedy and ungreedy when the quantifiers match. Here we have added a detailed introduction.

Linux Regular Expression: named subpattern)

Some tools (such as Python) can refer to the name in reverse order to define the naming subpattern. In Python, the use of regular expressions is in the format of function or method call. The syntax is quite different from the example here. If you are interested, refer to your own tools to see if the naming submode is supported.
Repetition and quantifiers)
In the previous section on reverse references, we have come into use with the concept of quantifiers. For example, the previous example/([abc]) {3}/represents three consecutive characters, each character must be one of the three characters "abc. In this mode, {3} is a quantizer. It indicates the number of repetition values in a pattern.
Quantifiers can be placed after the following items:
● Single character (may be a single character escaped, such as xhh)
● "." Metacharacters
● Character classes represented by square brackets
● Reverse reference
● Subpattern defined by parentheses (unless it is an asserted, we will introduce it later)
The most common quantifiers are two numbers separated by commas (,) enclosed in curly brackets, for example, {min, max, /z {2, 4}/can match "zz", "zzz", or "zzzz". The maximum value in curly braces and the preceding comma can be omitted, for example,/d {3 ,} /You can match more than three numbers. There is no upper limit on the number, And/d {3}/(note, there is no comma) exactly matches three numbers. When curly braces appear at locations where quantifiers are not allowed or the syntax does not match the one mentioned above, they only represent the curly braces themselves and do not have special meanings. For example, {, 6} is not a quantizer. It only represents the meaning of the four characters.
For convenience, the three most common quantifiers have their single-character abbreviations. Their meanings are as follows:
* Equivalent to {0 ,}
+ Equivalent to {1 ,}
? Equivalent to {0, 1}
This is also the meaning of the above three metacharacters as quantifiers.
When using quantifiers, especially those with no upper limit, be sure not to form an infinite loop, for example,/(?) */, In some regular expression tools. This produces a compilation error, but some tools allow this structure, but it cannot be ensured that all tools can handle this structure well.
"Greedy" and "ungreedy" matching quantifiers"
When using the pattern with quantifiers, we often find that the same target string can have multiple matching methods for the same pattern. For example,/d {0, 1} d/can match two or three decimal digits. If the target string is 123, when the quantifiers take the lower limit 0, it matches "12 ", when the quantifiers are up to 1, it matches the entire character "123. The two matching results are correct. If we take its sub-mode/(d {0, 1} d)/, will the matching result 1 be "12" or "123 "?
The actual running result is generally the latter, because by default, most regular expression tools match according to the "greedy" principle. The meaning of the word "greedy" is "greedy, greedy", and its behavior is also the meaning of the word. The so-called greedy matching means that it is within the limit of the quantifiers, as long as the matching of the subsequent pattern can be maintained, the matching always repeats as much as possible until the mismatch occurs. For ease of understanding, let's look at the simple example below.
/(D {12345}) d/matches the string "". This pattern indicates that a number is followed by a number ranging from 1 to 5, when its value is 1-4, the entire pattern is matched. The value of 1 can be "1", "12", "123", "1234 ", in the case of greedy matching, it obtains the maximum quantifiers for matching, so the final matching result is "1234 ".
In most cases, this is what we want, but this is not always the case. For example, we want to extract the comments in C language in the following mode (in C, the Comment statement is placed between the string/* and ). The regular expression we use is/*. **/, but the matching result is completely different from what we need. When the regular expression is parsed to "/*", * ", because". "can represent any character, which also contains the" */"that needs to be matched. this match will continue with the quantifiers, beyond the next "*"/until the end of the text, this is obviously not the result we need.
In order to complete the match we want in the above example, the regular expression introduces the ungreedy matching method, which is opposite to greedy. It always returns the smallest number of quantifiers when the entire pattern match is satisfied. The Ungreedy match is followed by the question mark "?" . For example, when matching C-language comments, we write the regular expression in the following format :/*.*? */, Add a question mark after the quantizer "*" to achieve the desired result. In the previous example, use/(d {12345}) d/to match the "" string. If it is rewritten to the ungreedy mode, then/(d }?) D/, and the value of 1 is 1.
The above explanation may be inaccurate. The question mark after the quantifiers is used to reverse the greedy and ungreedy behaviors of the current regular expression. You can use the pattern modifier "U" to set the regular expression to the ungreedy mode, and then use the question mark after the quantifiers in the pattern to reverse it to greedy.

Linux Regular Expression: one-time submode (Once-only subpatterns)

Another interesting topic about quantifiers is the Once-only subpatterns ). To understand its concept, you must first understand the matching process of regular expressions containing quantifiers. Here is an example.
Now, let's use the pattern/d + foo/to match the string "123456bar". Of course, the result is not matched. But how does the Regular Expression Engine work? It first analyzes the preceding d +, which represents more than one number, and then checks the first character "1" at the corresponding position of the target string to conform to the pattern, then, the string is matched according to the pattern repeated by quantifiers until "123456" always conforms to the "d +" pattern, then it encounters the character "B" in the target string and cannot match "d +". Therefore, view the subsequent mode "foo" of "d + ", it cannot match the subsequent "bar" of the target string. In this case, interesting things occur. The interpretation engine will backtrack the previously resolved "d +" mode, reduce the number of quantifiers by one to check whether the remaining part can be matched. At this time, the value of "d +" is changed to "12345 ", the interpretation engine then checks whether the remaining part of the target string "6 bar" can match the remaining mode "foo". If not, the number of quantifiers is reduced by one until the minimum limit is reached, if the target string cannot be matched, an unmatched result is returned.
Now, we can access the one-time submodel. The one-time subpattern defines the subpattern that does not require the above backtracking process during regular expression parsing. It is represented by a question mark (?>) and a smaller sign (?>) following the left parentheses ). If you change the example mentioned above to the one-time submode, you can write it as follows:
/(?> D) + foo/. In this case, when the parser encounters a bar that does not match in the future, it will immediately return the unmatched result without performing the Backtracking process mentioned above.
It should be noted that the one-time sub-mode is a non-capturing sub-mode, and its matching results cannot be reverse referenced.
When a submode with no repeated upper limit is included in the same pattern with no repeated upper limit, using the one-time submode is the only way to avoid your program from waiting for a long time. For example, you can use "/(D + | <d +>) * [!?] /"This pattern matches a long string of a characters. In this way," aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ", you will wait for a long time before returning the final result without matching. This mode indicates a string of non-numeric characters or a string of numbers enclosed by Angle brackets followed by an exclamation mark or question mark. There are many methods to divide the string into two duplicates, the possible values of quantifiers in the submode itself and in the submode must be tested one by one, which will greatly increase the calculation workload. In this way, you will wait for a long time before you can see the results. If you use the one-time submode to rewrite the previous mode, change it to this/(?> D +) | <d +>) * [!?] /, You can quickly get the calculation result.


  1. Linux Regular Expression 1)
  2. Describes how to install a Linux virtual machine.
  3. Detailed analysis of ten aspects of Linux Server Security Protection
  4. Share a simple Linux data backup solution
  5. How to handle Linux crashes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.