Regular Expression Quick Start (2)
[Introduction] In this article, we mainly introduce subpatterns, Back references, and quantifiers)
In the previous article, we introduced pattern modifiers and metacharacters of regular expressions. Readers may find that this section is very simple, there are few practical examples. This is mainly because the existing regular expression materials on the Internet have a detailed introduction to this Part and many examples. If you feel that you are not familiar with the previous part, you can refer to these materials. This article hopes to involve as many advanced regular expression features as possible.
In this article, we mainly introduce subpatterns, Back references, and quantifiers, which focus on some extended applications of these concepts, for example, in the submode, the non-capturing submode matches the greedy and ungreedy when the quantifiers match.
Subpatterns and Back references)
Regular Expressions can contain multiple character modes. The child mode is bounded by parentheses and can be nested. This is also the role of two metacharacters "(" and. The sub-mode can have the following functions:
1. Select one branch for localization.
For example, if cat (aract | erpillar |) matches one of "cat", "cataract", or "caterpillar", it matches "cataract ", "erpillar" or an empty string.
2. Set the sub-mode to the capture sub-mode (for example, in the above example ). During full mode matching, the part of the target string that matches the sub-mode can be called through reverse reference. The left parentheses are counted from left to right (starting from 1) to obtain the number of capture submodes.
Note that the sub-mode can be nested. For example, if the string "the red king" is matched with the mode/the (red | white) (king | queen, the captured substrings are "red king", "red", and "king", and are counted as 1, 2, and 3. You can use "1", "2 ", "3" to reference them separately. "1" contains "2" and "3", and their sequence numbers are determined by the sequence of left parentheses.
In some old linux/unux tools, parentheses used in the child mode need to be escaped by backslash (subpattern), but modern tools no longer need them, examples used in this article are not escaped.
Non-capturing subpatterns)
When you use a pair of parentheses to complete the two features of the sub-mode mentioned above at the same time, some problems may occur, for example, because the number of reverse references is limited (usually not more than 9 ), in addition, sub-mode definitions that do not need to be captured are often encountered. At this time, you can add question marks and colons after the start brackets to indicate that this submode does not need to be captured, so it is like this (? : Red | white) (king | queen )).
If "the white queen" is used as the target string for pattern matching, the captured strings include "white queen" and "queen", respectively as "1" and "2 ", although white complies with the sub-mode "(? : Red | white) ", but not captured.
We have already introduced the method of using parentheses and question marks to represent the pattern modifier. For convenience, if you need to insert a pattern modifier in non-capture submode, you can place it directly between the question mark and the colon. For example, the following two modes are equivalent.
/(? I: Saturday | Sunday) // and /(?? I) Saturday | Sunday )/.
Back references)
When we introduced the backslash function, we mentioned that it is used to represent reverse references. When a backslash outside the character class is followed by a decimal number greater than 0, it is likely to be a reverse reference. Its meaning is like its name. It indicates a reference to the subpattern that has been captured before. This number represents the order in which the left parentheses referenced by it appear in the pattern. We have seen an example of reverse referencing when introducing the subpattern, where "1" exists ", "2", "3" respectively indicate the content of the subpattern defined by the first, second, and third parentheses captured.
It is worth noting that when the number after the backslash is less than 10, it can be determined that this is a reverse reference, this reverse reference can appear before a certain number of left parentheses are captured without confusion. Only the entire mode can provide so many capture submodes, so no error is reported. It seems confusing. Let's take a look at the example below. Modify the example given when introducing the sub-model. the string "the red king" is used to match the pattern/the (red | white) (king | queen, the captured substrings are "red king", "red", and "king" and are counted as 1, 2, and 3. Now, modify the string to "king, the red king ", change the mode to/3, the (red | white) (king | queen)/, this mode should also be able to match. However, not all regular expression tools support this usage. The safe practice is to use reverse references after the left parenthesis of the corresponding serial number.
Note that the reverse reference value is a string segment that actually captures the sub-mode in the target string, rather than the sub-mode itself. For example,/(sens | respons) e and 1 ibility/matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility ". When the subpattern to be referenced is followed by a quantifier and therefore repeatedly matched multiple times, the value of the subpattern to be referenced is subject to the value of the last match. For example, when/([abc]) {3}/matches the string "abc", the value of "1" in reverse reference will be the last matched result "c ".
Named subpattern)
Some tools (such as Python) can refer to the name in reverse order to define the naming subpattern. In Python, the use of regular expressions is in the format of function or method call. The syntax is quite different from the example here. If you are interested, refer to your own tools to see if the naming submode is supported.
Non-capturing subpatterns)
When you use a pair of parentheses to complete the two features of the sub-mode mentioned above at the same time, some problems may occur, for example, because the number of reverse references is limited (usually not more than 9 ), in addition, sub-mode definitions that do not need to be captured are often encountered. At this time, you can add question marks and colons after the start brackets to indicate that this submode does not need to be captured, so it is like this (? : Red | white) (king | queen )).
If "the white queen" is used as the target string for pattern matching, the captured strings include "white queen" and "queen", respectively as "1" and "2 ", although white complies with the sub-mode "(? : Red | white) ", but not captured.
We have already introduced the method of using parentheses and question marks to represent the pattern modifier. For convenience, if you need to insert a pattern modifier in non-capture submode, you can place it directly between the question mark and the colon. For example, the following two modes are equivalent.
/(? I: Saturday | Sunday) // and /(? :(? I) Saturday | Sunday )/.
Back references)
When we introduced the backslash function, we mentioned that it is used to represent reverse references. When a backslash outside the character class is followed by a decimal number greater than 0, it is likely to be a reverse reference. Its meaning is like its name. It indicates a reference to the subpattern that has been captured before. This number represents the order in which the left parentheses referenced by it appear in the pattern. We have seen an example of reverse referencing when introducing the subpattern, where "1" exists ", "2", "3" respectively indicate the content of the subpattern defined by the first, second, and third parentheses captured.
It is worth noting that when the number after the backslash is less than 10, it can be determined that this is a reverse reference, this reverse reference can appear before a certain number of left parentheses are captured without confusion. Only the entire mode can provide so many capture submodes, so no error is reported. It seems confusing. Let's take a look at the example below. Modify the example given when introducing the sub-model. the string "the red king" is used to match the pattern/the (red | white) (king | queen, the captured substrings are "red king", "red", and "king" and are counted as 1, 2, and 3. Now, modify the string to "king, the red king ", change the mode to/3, the (red | white) (king | queen)/, this mode should also be able to match. However, not all regular expression tools support this usage. The safe practice is to use reverse references after the left parenthesis of the corresponding serial number.
Note that the reverse reference value is a string segment that actually captures the sub-mode in the target string, rather than the sub-mode itself. For example,/(sens | respons) e and 1 ibility/matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility ". When the subpattern to be referenced is followed by a quantifier and therefore repeatedly matched multiple times, the value of the subpattern to be referenced is subject to the value of the last match. For example, when/([abc]) {3}/matches the string "abc", the value of "1" in reverse reference will be the last matched result "c ".
Named subpattern)
Some tools (such as Python) can refer to the name in reverse order to define the naming subpattern. In Python, the use of regular expressions is in the format of function or method call. The syntax is quite different from the example here. If you are interested, refer to your own tools to see if the naming submode is supported.
Repetition and quantifiers)
In the previous section on reverse references, we have come into use with the concept of quantifiers. For example, the previous example/([abc]) {3}/represents three consecutive characters, each character must be one of the three characters "abc. In this mode, {3} is a quantizer. It indicates the number of repetition values in a pattern.
Quantifiers can be placed after the following items:
? ● Single character (may be a single character escaped, such as xhh)
? ● "." Metacharacters
? ● Character classes represented by square brackets
? ● Reverse reference
? ● Subpattern defined by parentheses (unless it is an asserted, we will introduce it later)
The most common quantifiers are two numbers separated by commas (,) enclosed in curly brackets, for example, {min, max, /z {2, 4}/can match "zz", "zzz", or "zzzz". The maximum value in curly braces and the preceding comma can be omitted, for example,/d {3 ,} /You can match more than three numbers. There is no upper limit on the number, And/d {3}/(note, there is no comma) exactly matches three numbers. When curly braces appear at locations where quantifiers are not allowed or the syntax does not match the one mentioned above, they only represent the curly braces themselves and do not have special meanings. For example, {, 6} is not a quantizer. It only represents the meaning of the four characters.
For convenience, the three most common quantifiers have their single-character abbreviations. Their meanings are as follows:
* Equivalent to {0 ,}
+ Equivalent to {1 ,}
? Equivalent to {0, 1}
This is also the meaning of the above three metacharacters as quantifiers.
When using quantifiers, especially those with no upper limit, be sure not to form an infinite loop, for example,/(?) */, In some regular expression tools. This produces a compilation error, but some tools allow this structure, but it cannot be ensured that all tools can handle this structure well.
Repetition and quantifiers)
In the previous section on reverse references, we have come into use with the concept of quantifiers. For example, the previous example/([abc]) {3}/represents three consecutive characters, each character must be one of the three characters "abc. In this mode, {3} is a quantizer. It indicates the number of repetition values in a pattern.
Quantifiers can be placed after the following items:
? ● Single character (may be a single character escaped, such as xhh)
? ● "." Metacharacters
? ● Character classes represented by square brackets
? ● Reverse reference
? ● Subpattern defined by parentheses (unless it is an asserted, we will introduce it later)
The most common quantifiers are two numbers separated by commas (,) enclosed in curly brackets, for example, {min, max, /z {2, 4}/can match "zz", "zzz", or "zzzz". The maximum value in curly braces and the preceding comma can be omitted, for example,/d {3 ,} /You can match more than three numbers. There is no upper limit on the number, And/d {3}/(note, there is no comma) exactly matches three numbers. When curly braces appear at locations where quantifiers are not allowed or the syntax does not match the one mentioned above, they only represent the curly braces themselves and do not have special meanings. For example, {, 6} is not a quantizer. It only represents the meaning of the four characters.
For convenience, the three most common quantifiers have their single-character abbreviations. Their meanings are as follows:
* Equivalent to {0 ,}
+ Equivalent to {1 ,}
? Equivalent to {0, 1}
This is also the meaning of the above three metacharacters as quantifiers.
When using quantifiers, especially those with no upper limit, be sure not to form an infinite loop, for example,/(?) */, In some regular expression tools. This produces a compilation error, but some tools allow this structure, but it cannot be ensured that all tools can handle this structure well.
"Greedy" and "ungreedy" matching quantifiers"
When using the pattern with quantifiers, we often find that the same target string can have multiple matching methods for the same pattern. For example,/d {0, 1} d/can match two or three decimal digits. If the target string is 123, when the quantifiers take the lower limit 0, it matches "12 ", when the quantifiers are up to 1, it matches the entire character "123. The two matching results are correct. If we take its sub-mode/(d {0, 1} d)/, will the matching result 1 be "12" or "123 "?
The actual running result is generally the latter, because by default, most regular expression tools match according to the "greedy" principle. The meaning of the word "greedy" is "greedy, greedy", and its behavior is also the meaning of the word. The so-called greedy matching means that it is within the limit of the quantifiers, as long as the matching of the subsequent pattern can be maintained, the matching always repeats as much as possible until the mismatch occurs. For ease of understanding, let's look at the simple example below.
/(D {12345}) d/matches the string "". This pattern indicates that a number is followed by a number ranging from 1 to 5, when its value is 1-4, the entire pattern is matched. The value of 1 can be "1", "12", "123", "1234 ", in the case of greedy matching, it obtains the maximum quantifiers for matching, so the final matching result is "1234 ".
In most cases, this is what we want, but this is not always the case. For example, we want to extract the comments in C language in the following mode (in C, the Comment statement is placed between the string/* and ). The regular expression we use is/*. **/, but the matching result is completely different from what we need. When the regular expression is parsed to "/*", * ", because". "can represent any character, which also contains the" */"that needs to be matched. this match will continue with the quantifiers, beyond the next "*"/until the end of the text, this is obviously not the result we need.
In order to complete the match we want in the above example, the regular expression introduces the ungreedy matching method, which is opposite to greedy. It always returns the smallest number of quantifiers when the entire pattern match is satisfied. The Ungreedy match is followed by the question mark "?" . For example, when matching C-language comments, we write the regular expression in the following format :/*.*? */, Add a question mark after the quantizer "*" to achieve the desired result. In the previous example, use/(d {12345}) d/to match the "" string. If it is rewritten to the ungreedy mode, then/(d }?) D/, and the value of 1 is 1.
The above explanation may be inaccurate. The question mark after the quantifiers is used to reverse the greedy and ungreedy behaviors of the current regular expression. You can use the pattern modifier "U" to set the regular expression to the ungreedy mode, and then use the question mark after the quantifiers in the pattern to reverse it to greedy.
One-time submode (Once-only subpatterns)
Another interesting topic about quantifiers is the Once-only subpatterns ). To understand its concept, you must first understand the matching process of regular expressions containing quantifiers. Here is an example.
Now, let's use the pattern/d + foo/to match the string "123456bar". Of course, the result is not matched. But how does the Regular Expression Engine work? It first analyzes the preceding d +, which represents more than one number, and then checks the first character "1" at the corresponding position of the target string to conform to the pattern, then, the string is matched according to the pattern repeated by quantifiers until "123456" always conforms to the "d +" pattern, then it encounters the character "B" in the target string and cannot match "d +". Therefore, view the subsequent mode "foo" of "d + ", it cannot match the subsequent "bar" of the target string. In this case, interesting things occur. The interpretation engine will backtrack the previously resolved "d +" mode, reduce the number of quantifiers by one to check whether the remaining part can be matched. At this time, the value of "d +" is changed to "12345 ", the interpretation engine then checks whether the remaining part of the target string "6 bar" can match the remaining mode "foo". If not, the number of quantifiers is reduced by one until the minimum limit is reached, if the target string cannot be matched, an unmatched result is returned.
Now, we can access the one-time submodel. The one-time subpattern defines the subpattern that does not require the above backtracking process during regular expression parsing. It is represented by a question mark (?>) and a smaller sign (?>) following the left parentheses ). If you change the example mentioned above to the one-time submode, you can write it as follows:
/(?> D) + foo/. In this case, when the parser encounters a bar that does not match in the future, it will immediately return the unmatched result without performing the Backtracking process mentioned above.
It should be noted that the one-time sub-mode is a non-capturing sub-mode, and its matching results cannot be reverse referenced.
When a submode with no repeated upper limit is included in the same pattern with no repeated upper limit, using the one-time submode is the only way to avoid your program from waiting for a long time. For example, you can use "/(D + | <d +>) * [!?] /"This pattern matches a long string of a characters. In this way," aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ", you will wait for a long time before returning the final result without matching. This mode indicates a string of non-numeric characters or a string of numbers enclosed by Angle brackets followed by an exclamation mark or question mark. There are many methods to divide the string into two duplicates, the possible values of quantifiers in the submode itself and in the submode must be tested one by one, which will greatly increase the calculation workload. In this way, you will wait for a long time before you can see the results. If you use the one-time submode to rewrite the previous mode, change it to this/(?> D +) | <d +>) * [!?] /, You can quickly get the calculation result.
Regular Expression Quick Start (3)
In the previous section, we introduced the subpatterns, reverse references, and quantifiers of regular expressions. In this article, we will focus on Assertions in regular expressions ).
Assertions)
Assertions is a test performed at the current matching position of the target string. However, this test does not occupy the target string, that is, it does not move the current matching position of the pattern in the target string.
It seems a bit difficult to read. Let's take a few simple examples.
The two most common assertions are metacharacters "^" and "$", which check whether the matching mode appears at the beginning or end of the row.
Let's look at this pattern/^ ddd $/and try to use it to match the target string "123 ". "Ddd" indicates three numeric characters that match the three characters of the target string, while "^" and "$" in "Mode indicate that the three characters both appear at the beginning and end of the row, they do not correspond to any character in the target string.
Some other simple assertions, B, A, Z, and z, all start with A backslash. We have already introduced this usage of the backslash. The meanings of these assertions are shown in the following table.
Asserted meaning
B-word demarcation line
B Non-word line
The beginning of the target (independent from the multiline Mode)
The end of the Z target or the line break at the end (independent from the multiline Mode)
End of the z target (independent from the multiline Mode)
The first matching position in the G target
Note that these assertions cannot appear in character classes. If they appear, they also have other meanings. For example, B indicates the backslash character 0x08 in the character class.
The assertion tests described earlier are based on the current location, and the assertion also supports more complex test conditions. More complex assertions are represented in sub-mode, including Lookahead assertions and Lookbehind assertions ).
Forward assertions (Lookahead assertions)
The forward assertion tests whether the assertion condition is true from the current position of the target string. Forward assertions can be divided into forward positive assertions and forward negative assertions (? = And {?! . For example, mode/w + (? =;)/Indicates a semicolon after a string of text characters, but this Semicolon is not included in the matching result. An interesting thing looks like a similar pattern /(? =;) W +/does not indicate an alpha string that is not preceded by a semicolon. In fact, it always matches whether or not the alpha string is preceded by a semicolon, to complete this function, we need the Lookbehind assertions mentioned below ).
Backward assertions (Lookbehind assertions)
Backward assertions use (? <= And (? <! Backward assertions that are positive and negative. For example ,/(? <! Foo) bar/will look for a bar string that is not prior to foo. Generally, the submode used by backward assertions requires a fixed length value. Otherwise, a compilation error occurs.
Use backward assertion and one-time submode to match the end part of valid text. Here is an example.
Consider if you use a simple pattern like/abcd $/to match a long text ending with abcd, because the pattern matching process is performed from left to right, the Regular Expression Engine will search for each a character in the text and try to match the remaining pattern. If there are only a few a characters in this long text, this is obviously very inefficient, if the above mode is changed to a sample/^. * abcd $/, then the "^. * "the part will match the entire text, and then it finds that the next mode a cannot match. In this case, the Backtracking process mentioned above will occur, and the parser will gradually shorten" ^. * "the matching character length is used to find the remaining sub-modes from right to left. multiple attempts are also required. Now, we use the one-time submode and backward assertions to rewrite the mode to/^ (?>. *)(? <= Abcd)/. In this case, the child mode matches the entire text at a time, and then uses backward assertions to check whether the first four characters are abcd, you can check whether the entire pattern matches exactly once. This method can significantly improve the processing efficiency when a long text needs to be matched.
One pattern can contain multiple successive assertions, and can also be nested. In addition, the sub-mode used by assertion is not captured and cannot be referenced in reverse order.
An important field of assertion is the condition of the condition submodel. So what is the condition submodel?
Conditional subpatterns)
Regular Expressions allow different matching subpatterns in the pattern based on different conditions. That is, the Conditional subpatterns ). The format is as follows? (Condition) yes-pattern) or (? (Condition) yes-pattern | no-pattern ). If the conditions are met, use yes-pattern. Otherwise, use no-pattern (if provided in the mode ).
There are two conditions in the condition submode. One is the result of an assertion, and the other is to check whether a submode is captured.
If the content in the parentheses that indicate the condition is a number, it indicates that the condition is true when the subpattern represented by this number is successfully matched. Take a look at the example below ,/(()? [^ ()] + (? (1)/x, (note that the "x" pattern modifier indicates that the blank characters outside the character class and the content after the # symbol are ignored ).
The first part of this mode is "()?" Matching an optional left brace "(", the second part "[^ ()] +" matches more than one non-circular brace character, and the last part "(? (1) "is a condition subpattern, indicating that if 1 is captured, that is, the optional left parentheses, the third part should contain a right parentheses") ".
If it is an "R" character in the parentheses that indicate the condition, it indicates that the condition is true when this mode or submode is called recursively. At the top layer of the recursive call, this condition is false. We will introduce the recursion in regular expressions in the following sections.
If the condition is not a number or R character, it must be an asserted. Assertions can be the predecessor or backward assertions of affirmation or negation. Let's take a look at the example below.
/(? (? = [^ A-z] * [a-z])
D {2}-[a-z] {3}-d {2} | d {2}-d {2}-d {2})/x
To make this regular expression easier to read, we use the x pattern modifier, in this way, we can add a space in the mode to separate the format and represent the line without affecting the parsing of the mode.
The condition submode of the first line uses a positive forward asserted, indicating that a string of optional non-lowercase letters followed by a lowercase letter. In other words, it checks whether the target string contains at least one lowercase letter. If yes, it matches the target in the mode before "|, check whether the target is in the format of two numbers-three lowercase letters-two numbers. Otherwise, use "|" to match the target, check whether the target string is a three-digit two-digit decimal number separated.
Comments in Regular Expressions
To make the regular expression easier to read, you can add a comment statement to it. The comments are usually ended by the left parentheses and the well signs-"(#" when the next right parenthesis is encountered. Annotations prohibit nesting.
If the "x" pattern modifier is set, the section between the pound sign (#) outside of any character class (that is, beyond []) and the next new line mark is also treated as a comment.
Regular Expression Quick Start (4)
In the previous article, we introduced some concepts related to assertions in regular expressions. In this article, we will introduce the recursive use of regular expressions and the use of regular expressions to modify target strings.
Recursion in Regular Expressions
Friends who have been in touch with the program may have encountered various pairs of parentheses. These parentheses are often nested with each other, and the number of nested layers cannot be determined. Imagine how to extract a piece of code enclosed by parentheses in a program, which may include other parentheses with an indefinite number of layers?