Quick start of regular Expressions (ii)
"Guided reading" in this article, we mainly introduce subpatterns, reverse reference (back references) and quantifiers (quantifiers)
In the previous article, we introduced the pattern modifiers and metacharacters of regular expressions, and the attentive reader may find that this section is very brief and there are few practical examples to explain. This is mainly due to the fact that the existing regular expression information on the Internet has a detailed description of this part and a number of examples, if you feel that the previous part of the lack of knowledge can refer to this information. This article hopes to involve some of the more advanced regular expression features as much as possible.
In this article, we mainly introduce subpatterns, reverse reference (back references), and quantifiers (quantifiers), which highlight some of the extended applications of these concepts, such as non-capture child patterns in child schemas. The greedy and ungreedy when quantifiers match.
Sub-mode (subpatterns) and reverse reference (back references)
Regular expressions can contain multiple word patterns, which are bounded by parentheses and can be nested. This is also a function of the two-character "(" and ")". A child mode can have the following effects:
1. The branch that will select one more localization.
For example, pattern: Cat (aract|erpillar|) Matches "Cat", "cataract" or "caterpillar", without parentheses will match "cataract", "erpillar" or an empty string.
2. Set the child mode to capture sub Mode (example above). When the entire pattern matches, the part of the target string that matches the child pattern can be invoked by a reverse reference. The left parenthesis counts from left to right (starting at 1) to get the number of captured child modes.
Note that the child mode can be nested, for example, if the string "The Red King" is used to match the pattern/the ((red|white) (King|queen)/, the captured substring is "Red King", "Red" and "King", and is counted as 1,2 and 3, which can be referenced by "1", "2", "3", "1" contains "2" and "3", and their ordinal numbers are determined by the order of the left parenthesis.
In some old linux/unux tools, the parentheses used by the child pattern need to be escaped with a backslash, to this (Subpattern), but modern tools are no longer needed, and the examples used in this article are not escaped.
Non-capture sub-mode (non-capturing subpatterns)
There are sometimes problems with the two functions that complete the above mentioned child modes with a pair of parentheses, for example, because the number of reverse references is limited (usually up to 9), and a child schema definition without capture is often encountered. At this point, you can add a question mark and a colon at the beginning of the parentheses to indicate that the child pattern does not need to be captured, to the following (?: Red|white) (King|queen).
If the "White Queen" is a pattern-matching target string, the captured string has "White Queen" and "Queen", respectively, as "1" and "2", while white conforms to sub mode "(?: Red|white)" but is not captured.
We have previously described the method of representing pattern modifiers with parentheses and question marks, and for convenience, if you need to insert a pattern modifier in a non-capture child mode, you can place it directly between the question mark and the colon, for example, the following two modes are equivalent.
/(? i:saturday|sunday)/and/(?? i) saturday|sunday)/.
Reverse reference (back references)
One of the things that has been mentioned in the preceding backslash action is a reverse reference, which is likely to be a reverse reference when a backslash outside the character class is followed by a decimal number greater than 0. It means just as its name implies, it represents a reference to a child pattern that has been captured before it appears. This number represents the order in which the left parenthesis that it refers to appears in the pattern, we have seen an example of a reverse reference in the introduction of the subtype, where the over "1", "2", "3" represent the contents of the captured first, second, and third parenthesis-defined child modes.
It is noteworthy that when the number after the backslash is less than 10 o'clock, it can be determined that this is a reverse reference, so that the reverse reference can occur before the corresponding number of left parentheses before the capture without confusion, only the entire pattern can provide so many catch child mode, there is no error. It seems to be confusing, let's look at the following example. To modify the example of the introduction of the child mold, the string "The Red King" is mentioned earlier and the Pattern/the ((red|white)/king|queen)/match, the captured substring is "Red King", "Red" and "King", and is counted as 1, 2 and 3, now change the string to "King,the Red King" and the pattern to/3,the (red|white) (King|queen)/, the pattern should also be matched. However, not all regular expression tools support this usage, and it is safe to use the reverse reference associated with it after the opening parenthesis of the corresponding ordinal.
Another point to note is that the inverse reference value is a string fragment that is actually captured in the target string rather than the child mode itself. For example/(Sens|respons) E and 1ibility/will match "Sense and Sensibility" and "response and responsibility", but not "sense and responsibility "。 When the backward-referenced child mode has a quantifier that is repeated to match several times, the value of the reverse reference is the last match. For example/([ABC]) {3}/matches the string "ABC", the value of the reverse reference "1" will be the result "C" of the last match.
Named child mode (named Subpattern)
Some tools, such as Python, can be named for a reverse reference to define a named child pattern. The use of regular expressions in Python is in the form of a function or method invocation, and syntax differs considerably from the example given here. Interested friends can refer to the tools they use to see if they support named child mode.
Non-capture sub-mode (non-capturing subpatterns)
There are sometimes problems with the two functions that complete the above mentioned child modes with a pair of parentheses, for example, because the number of reverse references is limited (usually up to 9), and a child schema definition without capture is often encountered. At this point, you can add a question mark and a colon at the beginning of the parentheses to indicate that the child pattern does not need to be captured, to the following (?: Red|white) (King|queen).
If the "White Queen" is a pattern-matching target string, the captured string has "White Queen" and "Queen", respectively, as "1" and "2", while white conforms to sub mode "(?: Red|white)" but is not captured.
We have previously described the method of representing pattern modifiers with parentheses and question marks, and for convenience, if you need to insert a pattern modifier in a non-capture child mode, you can place it directly between the question mark and the colon, for example, the following two modes are equivalent.
/(? i:saturday|sunday)/And/(?:(? i) saturday|sunday)/.
Reverse reference (back references)
One of the things that has been mentioned in the preceding backslash action is a reverse reference, which is likely to be a reverse reference when a backslash outside the character class is followed by a decimal number greater than 0. It means just as its name implies, it represents a reference to a child pattern that has been captured before it appears. This number represents the order in which the left parenthesis that it refers to appears in the pattern, we have seen an example of a reverse reference in the introduction of the subtype, where the over "1", "2", "3" represent the contents of the captured first, second, and third parenthesis-defined child modes.
It is noteworthy that when the number after the backslash is less than 10 o'clock, it can be determined that this is a reverse reference, so that the reverse reference can occur before the corresponding number of left parentheses before the capture without confusion, only the entire pattern can provide so many catch child mode, there is no error. It seems to be confusing, let's look at the following example. To modify the example of the introduction of the child mold, the string "The Red King" is mentioned earlier and the Pattern/the ((red|white)/king|queen)/match, the captured substring is "Red King", "Red" and "King", and is counted as 1, 2 and 3, now change the string to "King,the Red King" and the pattern to/3,the (red|white) (King|queen)/, the pattern should also be matched. However, not all regular expression tools support this usage, and it is safe to use the reverse reference associated with it after the opening parenthesis of the corresponding ordinal.
Another point to note is that the inverse reference value is a string fragment that is actually captured in the target string rather than the child mode itself. For example/(Sens|respons) E and 1ibility/will match "Sense and Sensibility" and "response and responsibility", but not "sense and responsibility "。 When the backward-referenced child mode has a quantifier that is repeated to match several times, the value of the reverse reference is the last match. For example/([ABC]) {3}/matches the string "ABC", the value of the reverse reference "1" will be the result "C" of the last match.
Named child mode (named Subpattern)
Some tools, such as Python, can be named for a reverse reference to define a named child pattern. The use of regular expressions in Python is in the form of a function or method invocation, and syntax differs considerably from the example given here. Interested friends can refer to the tools they use to see if they support named child mode.
Repetition (repetition) and quantifiers (quantifiers)
We have come to terms with the concept of quantifier (quantifiers) in the previous part of the reverse reference, for example, the previous example/([ABC]) {3}/represents three consecutive characters, each of which is necessarily one of the three characters "ABC". In this pattern, {3} belongs to the quantifier. It represents the number of duplicate matches (repetition) that a pattern requires.
Quantifiers can be placed after the following items:
? A single character (possibly an escaped single character, such as XHH)
? "." Metacharacters
? A character class represented by square brackets
? Reverse reference
? sub-mode defined by parentheses (unless it is an assertion, we'll introduce it later)
The most common form of quantifier usage is the two comma-delimited numbers enclosed in curly braces, such as the format {Min,max}, for example,/z{2,4}/can match "zz", "zzz", or "zzzz", the maximum value in curly braces and the preceding comma can be omitted, for example,/d{3,}/ You can match more than three digits, the number of numbers is not capped, and/d{3}/(note, no commas) exactly matches 3 digits. When the curly braces appear in the position where the quantifier is not allowed or the syntax is not the same as the preceding mentioned, it simply represents the curly bracket character itself and no longer has a special meaning. For example {, 6} is not a quantifier, it simply represents the meaning of the four characters themselves.
For convenience, the three most commonly used quantifiers have their single character abbreviations, meaning the following table:
* Equivalent to {0,}
+ equals {1,}
? Equivalent to {0,1}
This is also the use of the above three meta characters as quantifiers.
When using quantifiers, especially those with no upper limit, you should pay special attention to not forming infinite loops, such as/(a?) * *, in some regular expression tools. This can result in a compilation error, but some tools allow this structure, but there is no guarantee that all tools can handle the structure well.
Repetition (repetition) and quantifiers (quantifiers)
We have come to terms with the concept of quantifier (quantifiers) in the previous part of the reverse reference, for example, the previous example/([ABC]) {3}/represents three consecutive characters, each of which is necessarily one of the three characters "ABC". In this pattern, {3} belongs to the quantifier. It represents the number of duplicate matches (repetition) that a pattern requires.
Quantifiers can be placed after the following items:
? A single character (possibly an escaped single character, such as XHH)
? "." Metacharacters
? A character class represented by square brackets
? Reverse reference
? sub-mode defined by parentheses (unless it is an assertion, we'll introduce it later)
The most common form of quantifier usage is the two comma-delimited numbers enclosed in curly braces, such as the format {Min,max}, for example,/z{2,4}/can match "zz", "zzz", or "zzzz", the maximum value in curly braces and the preceding comma can be omitted, for example,/d{3,}/ You can match more than three digits, the number of numbers is not capped, and/d{3}/(note, no commas) exactly matches 3 digits. When the curly braces appear in the position where the quantifier is not allowed or the syntax is not the same as the preceding mentioned, it simply represents the curly bracket character itself and no longer has a special meaning. For example {, 6} is not a quantifier, it simply represents the meaning of the four characters themselves.
For convenience, the three most commonly used quantifiers have their single character abbreviations, meaning the following table:
* Equivalent to {0,}
+ equals {1,}
? Equivalent to {0,1}
This is also the use of the above three meta characters as quantifiers.
When using quantifiers, especially those with no upper limit, you should pay special attention to not forming infinite loops, such as/(a?) * *, in some regular expression tools. This can result in a compilation error, but some tools allow this structure, but there is no guarantee that all tools can handle the structure well.
"Greedy" and "ungreedy" of quantifier matching
When using a quantifier-only pattern, we often find that the same target string can have multiple matching methods for the same pattern. For example,/d{0,1}d/can match two or three decimal digits, and if the target string is 123, when the quantifier takes off the limit of 0, it matches "12", and when the quantifier is in the upper 1, it matches the entire character "123". Both of these matching results are correct, if we take its sub mode/(D{0,1}D)/, the result of the match 1 is "12" or "123"?
The actual run result is typically the latter, because by default, most regular expression tools match by the "greedy" principle. The meaning of the word "greedy" is "greedy, greedy" meaning, its behavior is also so the meaning of the word, the so-called greedy matching means within the quantifier limit, as long as the following pattern can be maintained, the match always repeats as far as possible until the mismatch occurs. For the sake of understanding, let's look at the following simple example.
/(d{1,5}) d/matches the string "12345", this pattern represents a number followed by 1 to 5 digits, the quantifier range from 1 to 5, and when its value is 1-4, the entire pattern is matched, and the 1 value can be "1", "12", "123", "1234", In case of greedy matching, it takes the quantifier maximum when it matches, so the result of the final match is "1234".
In most cases, this is the result we want, but that's not always the case. For example, we want to extract the annotation portion of the C language in the following pattern (in the C language, the annotation statement is placed between the string/* and/*). The regular expression we use is/*.**/, but the result of the match is completely different from what is needed. When the regular expression resolves to ". *" after "/*", because "." can represent any character, which also contains the "* *" that needs to be matched, and in the context of the quantifier, this match will go on, more than the next "*"/until the end of the text, which is obviously not the result we need.
To accomplish the kind of match we want in the example above, the regular expression introduces the Ungreedy matching method, which, contrary to the greedy match, always takes the smallest quantifier number result, and satisfies the whole pattern match. The Ungreedy match is used after the quantifier plus the question mark "?" to indicate. For example, when matching a C-language annotation, we write the regular expression as follows:/*.*?*/, adding a question mark after the quantifier "*" can achieve the desired result. And the previous example uses/(d{1,5}) d/to match the string "12345" if rewritten as Ungreedy mode to this/(d{1,5}?) D/, the value of 1 will be 1.
The above explanation may be somewhat inaccurate, and the effect of the question mark after the quantifier is actually to reverse the greedy and ungreedy behavior of the current regular expression. You can use the pattern modifier "U" to set the regular expression to ungreedy mode and then reverse it to greedy by the question mark after the quantifier in the pattern.
Primary mode (Once-only subpatterns)
Another interesting topic about quantifiers is the disposable child mode (once-only subpatterns). To understand its concept, we need to first understand the matching process of regular expressions that contain quantifiers. Let's give an example here.
Now, let's use pattern/d+foo/to match the string "123456bar", of course, and the result is no match. But how does the regular expression engine work? It first analyzes the previous d+, which represents more than one number, and then examines the corresponding position of the target string at the first character "1", conforms to the pattern, and then repeats the pattern according to the quantifier to match the string until "123456" always conforms to "d+" mode, Then it encounters the character "B" in the target string that cannot match "d+". Then, looking at the subsequent mode "foo" of "d+", which could not match the later part of the target string "bar", the interesting thing arose, and the interpretation engine retraced the previously resolved "d+" pattern, Reduce the number of quantifiers one to see if the rest can match, at this point, the value of "d+" is changed to "12345", and then the explanation engine looks at whether the remaining part of the target string "6bar" matches the remaining mode "foo", and if not, the quantifier number is reduced by one until the minimum quantifier limit is reached, if it still does not match , the target string cannot be matched, and a result that cannot be matched is returned.
Now we are ready to touch the disposable child mode. One-time child mode is defined as a subtype that does not require the backtracking procedure in regular expression parsing. It is represented by the question mark and the less-than sign following the left parenthesis, to this (?>). If you change the example mentioned above to a one-time child mode, you can write this:
/(? >d) +foo/, when the parser encounters a bar that does not match later, it immediately returns the result of the mismatch without making the backtracking procedure mentioned earlier.
It should be understood that the one-time child mode belongs to a non-capture child mode, and its matching result cannot be referenced in reverse.
When a child pattern that does not set a repeat upper limit contains a pattern that also does not set a repeat limit, using a one-time child mode is the only way to avoid getting your program into a long waiting time. For example you use "/(d+|<d+>) *[!?] /"This pattern matches a long string of a characters, to such" AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA "that you will wait for a long time before returning to the result of the final no match. This pattern represents a string of non-numeric characters or a string of digits enclosed in angle brackets followed by an exclamation point or question mark. There are many ways to divide this string into two parts, and the possible values of quantifiers, whether in the child mode itself or in the child mode, are tested one at a time. This will result in a very large amount of computational complexity. In this way, you will wait for a long time before the computer can see the result. And if you use the one-time child mode to rewrite just the mode, change to this/((? >d+) |<d+>) *[!?] /, you can quickly get the result of the operation.
Quick start of regular Expressions (iii)
In the above, we introduce the assertions of regular expressions, the reverse reference and quantifiers, in this article, we will focus on the assertion in the regular expression (a).
Assertion (assertions)
An assertion (assertions) is a test that is performed at the current matching location of the target string, but the test does not occupy the target string, nor does it move the current matching position of the mode in the target string.
It seems a bit awkward to read, but let's give a few simple examples.
The two most common assertions are metacharacters "^" and "$", which check whether the matching pattern appears at the beginning or end of the line.
Let's look at the pattern/^ddd$/and try to match the target string "123". DDD represents three numeric characters that match the three characters of the destination string, whereas the ^ and $ in the pattern represent the three characters that appear at the beginning and end of the line, and they do not correspond to any character in the target string.
Others have simple assertions B, B, A, Z, Z, and they all start with a backslash, and we've already covered this usage of the backslash before. The meanings of these assertions are shown in the following table.
Assertion meaning
B-word dividing line
B non-word dividing line
Start of a goal (independent of multiline mode)
End of Z target or line wrap match either (independent of multiline mode)
End of Z goal (independent of multiline mode)
The first matching position in the G target
Note that these assertions do not appear in the character class, and if they do, such as B in the character class that represents the backslash character 0x08.
The tests for the assertions described earlier are tests based on the current location, and assertions support more complex test conditions. The more complex assertions are represented in child mode, which includes forward assertions (lookahead assertions) and back assertions (lookbehind assertions).
Forward assertion (lookahead assertions)
The forward assertion tests whether the assertion condition is valid from the current position of the target string. Forward assertion can be divided into forward positive assertion and forward negative assertion, respectively (? = and {?!). Said. For example, the pattern/w+ (? =;)/is used to represent a string of text characters followed by a semicolon, but this semicolon is not included in the matching result. An interesting thing that looks similar to the pattern/(? =;) w+/is not a string of alpha strings preceded by a semicolon, in fact, it always matches whether or not the string alpha character is preceded by a semicolon, and to complete this function requires the following forward assertion that we mentioned below (lookbehind assertions).
Back assertion (Lookbehind assertions)
The assertion is followed by (? <= and (? <!) a positive and negative assertion. For example,/(? <!foo) bar/will look for a bar string that is not preceded by Foo. In general, the child mode used by the back assertion requires a fixed length value, or a compilation error is generated.
Use a backend assertion to match the end of a valid text with a one-time child pattern, and here's a look at the example.
Consider if you are using a simple pattern like/abcd$/to match a long section of text that ends in ABCD, because the pattern matching process is done from left to right, the regular expression engine will look for each a character in the text and try to match the remaining pattern, if there is only a good number of a characters in the long text, This is obviously very inefficient, and if the above pattern is changed into a sample/^.*abcd$/, then the previous "^.*" section will match the entire text, and then it finds that the next pattern a does not match, then the backtracking process mentioned earlier, the parser will shorten the "^.*" The matching character length finds the remaining child patterns from right to left, and also produces multiple attempts. Now, we use the pattern for the one-time child mode and the back assertion rewrite to/^ (? >.*) (? <=ABCD)/, at which point the one-time child mode matches the entire text one at a time, and then checks to see if the preceding four characters are ABCD. You can immediately determine if the entire pattern is matched by just one test. This approach can significantly improve processing efficiency when you encounter the need to match a very long text.
A pattern can contain multiple successive assertions, and assertions can also be nested. In addition, the child mode used by the assertion is also not captured and cannot be referenced backwards.
An important area of application of assertions is the condition of being a conditional child mode. So what is a conditional child mode?
Conditional sub Mode (Conditional subpatterns)
Regular expressions allow different matching child modes to be used in a pattern according to different conditions. That is, the conditional child mode (Conditional subpatterns). It's formatted as follows? (condition) Yes-pattern) or (?) ( Condition) yes-pattern|no-pattern). If conditions are met, use Yes-pattern, otherwise, use No-pattern (if provided in the schema).
There are two types of conditions in conditional child mode, one is the assertion result, and the other is to see if a previously provided child pattern is captured.
If the content in the parentheses that represents the condition is a number, it indicates that the condition is true when the child mode represented by this number is successfully matched. Take a look at the following example,/(()? [^()]+ (? (1)))/x, (note that the "x" mode modifier represents the omission of whitespace characters outside the character class and the contents of the # symbol).
The first part of the Model "(()?" Matches an optional left bracket "(", Part Two "[^ ()]+" matches more than one non-parenthesis character, the last part "(? 1)) "is a conditional child mode, indicating that if 1 is also the optional open parenthesis, the third part should appear with a closing parenthesis".
If an "R" character is inside the parentheses that represents the condition, the condition is true when the pattern or child mode is called recursively, and at the top of the recursive call, the condition is false. With regard to recursion in regular expressions, we will present some of the topics later in this section.
If the condition is not a number or an R character, it must be an assertion. Assertions can be either positive or negative precursors or back assertions. Let's look at the following example.
/(? (? =[^a-z]*[a-z])
D{2}-[A-Z]{3}-D{2} | D{2}-D{2}-D{2})/x
In order for this regular expression to be easier to read, we have deliberately adopted the X-mode modifier so that we can divide and branch the formatting in the pattern by adding a space-delimited representation without affecting the parsing of the schema.
The first line of conditional child mode uses a positive forward assertion, which indicates that a string of optional non-lowercase letters followed by a lowercase letter. In other words, it looks at whether the target string contains at least one lowercase letter, and if so, it uses the "|" The previous pattern matches the target to see if the target is two digits-three lowercase-two digits in this format, otherwise, use the "|" To match the target to see if the target string is a three segment two-bit decimal number separated by "-".
Comments in regular expressions
To make a regular expression easier to read, you can add an annotation statement to it. Usually the comment ends with the opening parenthesis and the well number-"(#"), when the next closing parenthesis "is encountered." Comments are not nested.
If the "x" mode modifier is set, the part of the Well sign (#) and the next new row tag outside of any character class (i.e. []) is also treated as a comment.
Quick start of regular Expressions (iv)
In the previous article, we introduced some concepts related to assertions in regular expressions, and in this article we will introduce the use of recursion in regular expressions and the use of regular expressions to modify the target string.
Recursion in regular expressions
Friends who have contacted the program may have encountered pairs of parentheses, which are often nested together, and the number of nested hierarchies cannot be determined. If you want to extract a piece of code that is enclosed in parentheses in a program, it's likely that there are other parentheses pairs with indefinite levels, how do you do that with regular expressions?
Current 1/2 page
12 Next read the full text