Regular Expressions (below) and regular expressions
This article is the continuation of the previous article "Regular Expression details (I)". It describes the group and backward reference in regular expressions, previous backward viewing, conditional testing, and word boundary, select the regular expression and other expressions and examples, and analyze the internal mechanism of the regular engine during matching.
9. Word boundary
Metacharacters <\ B> are also the "Anchors" that match locations ". This match is a 0-length match.
YesFour locations are considered as "Word boundary ":
1) position before the first character of the string (if the first character of the string is a "word character ")
2) position after the last character of the string (if the last character of the string is a "word character ")
3) between a "word character" and "non-word character", the "non-word character" follows the "word character"
4) between a non-word character and a word character, the word character follows the non-word character
"Word character" is a character that can be matched with "\ w", and "non-word character" is a character that can be matched with "\ W. In most regular expression implementations, the word character usually includes <[a-zA-Z0-9 _]>.
For example, <\ b4 \ B> can match a single 4 instead of a larger part. This regular expression does not match 4 in "44.
In other words, it can be said that <\ B> matches the start and end positions of a "letter/Number Sequence.
The inverse set of "Word boundary" is <\ B>. It is located between two "word characters" or between two "non-word characters.
Go deep into the Regular Expression Engine
Let's take a look at applying the regular expression <\ bis \ B> to the string "This island is beauul ul ". The engine processes the symbol <\ B> first. Because \ B is 0, the position before the first character T will be investigated. Because T is a "word character" and the character before it is a void, \ B matches the word boundary. Then, <I> and the first character "T" fail to match. The matching process continues until the Fifth Space Character matches the fourth character "s" <\ B>.
However, the space character does not match <I>. Continue backward, to the sixth character "I", matched with the Fifth Space Character <\ B>, then the <is> matches the Sixth and Seventh characters. However, the eighth character does not match the second "Word boundary", so the match fails. It reaches 13th characters, because it forms a "Word boundary" with the previous space character and matches <is> with "is" at the same time. The engine then tries to match the second <\ B>. Because 15th space characters and "s" form word boundaries, the match is successful. The engine is "Anxious" to return the successful matching result.
10. Selector
"|" In the regular expression indicates selection. You can use the selector to match one of multiple possible regular expressions.
If you want to search for the text "cat" or "dog", you can use <cat | dog>. If you want more options, you only need to expand the list <cat | dog | mouse | fish>.
The selector has the lowest priority in the regular expression, that is, it tells the engine to either match all expressions on the left of the selector or all expressions on the right. You can also use parentheses to limit the range of delimiters. For example, <\ B (cat | dog) \ B> tells the Regular Expression Engine to process (cat | dog) as a regular expression unit.
Pay attention to the Regular Expression Engine's "eager for expression"
The Regular Expression Engine is eager to stop searching when it finds a valid match. Therefore, under certain conditions, the order of the expressions on both sides of the separator will affect the result. Suppose you want to use a regular expression to search for a list of functions in a programming language: Get, GetValue, Set or SetValue. An obvious solution is <Get | GetValue | Set | SetValue>. Let's take a look at the result when searching for SetValue.
Because <Get> and <GetValue> both failed, and <Set> matched successfully. Because the regular expression-oriented engine is "eager", it will return the first successful match, that is, "Set", rather than continuing to search for other better matches.
Contrary to our expectation, the regular expression does not match the entire string. There are several possible solutions. First, considering the urgency of the regular engine, we can change the order of options. For example, we use <GetValue | Get | SetValue | Set>, in this way, we can first search for the longest match. We can also combine the four options into two options: <Get (Value )? | Set (Value)? >>. Because the repeat of question marks is greedy, SetValue will always be matched before Set.
A better solution is to use the word boundary: <\ B (Get | GetValue | Set | SetValue) \ B> or <\ B (Get (Value )? | Set (Value )? \ B>. Furthermore, since all the choices have the same ending, we can optimize the regular expression to <\ B (Get | Set) (Value )? \ B>.
11. Group and backward reference
Place a part of the regular expression in parentheses, And you can group them. Then you can use some regular operations for the entire group, such as repeated operators.
Note that only parentheses () can be used to form a group. "[]" Is used to define character sets. "{}" Is used to define repeated operations.
When "()" is used to define a regular expression group, the Regular Expression Engine will number the matched group in sequence and store it in the cache. When the Group to be matched is referenced backward, it can be referenced by "\ number. <\ 1> reference the first matched Back Reference Group, <\ 2> reference the second group, and so on, <\ n> references the nth group. <\ 0> references the entire matched regular expression itself. Let's look at an example.
Suppose you want to match the start tag and end tag of an HTML Tag, as well as the text in the middle of the tag. For example, <B> This is a test </B>, we need to match <B> and </B> and the text in the middle. We can use the following regular expression: '<([A-Z] [A-Z0-9] *) [^>] *> .*? </\ 1>"
First, "<" matches the first character "<B> ". Then [A-Z] matches B, [A-Z0-9] * will match 0 to multiple alphanumeric characters followed by 0 to multiple non-'>' characters. The ">" of the regular expression matches the ">" of "<B> ". The RegEx engine will perform a inert match on the characters before the end tag until a "</" symbol is encountered. The '\ 1' in the regular expression then represents a reference to the previously matched group' ([A-Z] [A-Z0-9] *) ', in this example, the referenced tag name is "B ". Therefore, the end label to be matched is "</B>"
You can reference the same back-to-Reference Group multiple times. <([a-c]) x \ 1x \ 1> matches "axaxa", "bxbxb", and "cxcxc ". If the group referenced in the form of numbers does not have a valid match, the referenced content is empty.
A back reference cannot be used by itself. <([Abc] \ 1)> is incorrect. Therefore, you cannot use <\ 0> to match a regular expression. It can only be used in replacement operations.
Backward reference cannot be used inside the character set. <(A) [\ 1b]> <\ 1> does not indicate backward reference. <\ 1> can be interpreted as octal transcoding.
Backward reference reduces the engine speed because it needs to store matching groups. If you do not need to back-reference, you can tell the engine not to store a group. Example: <Get (? : Value)>. "(" Followed by "? : "Will tell the engine for the group (Value), does not store matching values for backward reference.
(1) repeated operations and back-Reference
When the repeat operator is used for a group, the back-reference content in the cache is constantly refreshed, and only the last matched content is retained. For example, <([abc] +) = \ 1> matches "cab = cab", but <([abc]) + = \ 1> does not. Because ([abc]) when "c" is matched for the first time, "\ 1" indicates "c", and ([abc]) matches "a" and "B ". "\ 1" indicates "B", so it matches "cab = B ".
Application: Check duplicate words-when editing text, it is easy to enter duplicate words, such as "". You can use <\ B (\ w +) \ s + \ 1 \ B> to detect duplicate words. To delete the second word, simply replace "\ 1" with the replacement function.
(2) group name and reference
In PHP and Python, <(? P <name> group)> to name the group. In this example, lexical? P <name> is the name of the group. The name is the name of the group. You can use (? P = name.
. NET naming Group
. NET framework also supports naming groups. Unfortunately, Microsoft programmers decided to invent their own syntax instead of following the Perl and Python rules. So far, no other regular expressions have been implemented to support the syntax invented by Microsoft.
The following is an example in. NET:
(? <First> group )(? 'Second' group)
As you can see,. NET provides two lexical methods to create a naming group: one is to use the angle brackets "<>", or use single quotes "'' ". Angle brackets are more convenient to use in strings, and single quotes are more useful in ASP code, because "<>" is used as an HTML Tag in ASP code.
To reference a naming group, use \ k <name> or \ k'name '.
When you replace a search, you can use "$ {name}" to reference a naming group.
12. Regular Expression matching mode
The Regular Expression Engine discussed in this tutorial supports three matching modes:
</I> make the regular expression case insensitive,
</S> enable "single line mode", that is, the dot "." matches the new line character.
</M> enable "multiline mode", that is, "^" and "$" match the front and back positions of the new line character.
Enable or disable the Regular Expression
If you insert a modifier (? Ism), the modifier only takes effect for the regular expression on the right. (? -I) Disable case sensitivity. You can perform tests quickly. <(? I) te (? -I) st> it should match TEst, but cannot match teST or TEST.
13. Atomic group and prevention of backtracking
In some special cases, backtracing will make the engine extremely inefficient.
Let's take an example: to match such a string, each field in the string is separated by a comma, and the first 12th fields start with P.
We can easily think of such a regular expression <^ (.*?,) {11} P>. This regular expression works well under normal circumstances. However, in extreme cases, if the 12th fields do not start with P, catastrophic backtracking will occur. If the string to be searched is "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 ". First, the regular expression is successfully matched until it contains 12th characters. At this time, the regular expression consumes strings 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ", to the next character, <P> does not match" 12 ". Therefore, the engine performs backtracking. At this time, the regular expression consumes strings 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11 ". To continue the next matching process, the next regular symbol is the dot <.>. You can match the next comma (,). However, <,> does not match "1" in the character "12 ". Matching failed. Continue tracing. As you can imagine, such a backtracking combination is a very large number. Therefore, the engine may crash.
There are several solutions to prevent such huge backtracking:
A simple solution is to make matching as accurate as possible. Use the inverse Character Set instead of the point number. For example, we use the following regular expression <^ ([^, \ r \ n] *,) {11} P> to reduce the number of failed backtracking times to 11.
Another solution is to use an atomic group.
The purpose of an atomic group is to make the regularizedengine fail faster. Therefore, it can effectively prevent massive backtracking. The atomic group syntax is <(?> Regular Expression)>. Located in (?>) All regular expressions are considered as a single regular expression. Once the match fails, the engine will go back to the regular expression section before the atomic group. In the preceding example, an atomic group can be used for table fulfillment <^ (?> (.*?,) {11}) P>. Once 12th fields match, the engine goes back to the beginning of the atomic group <^>.
14. view forward and backward
Perl 5 introduces two powerful regular Syntax: "view forward" and "view backward ". They are also called "zero-length assertions ". They are zero-length (the so-called zero-length means that the regular expression does not consume matched strings ). The difference is that "check before and after" will actually match characters, but they will discard matching and only return matching results: matching or not matching. This is why they are called "assertions ". They do not actually consume characters in strings, but simply assert whether a matching is possible.
Almost all implementations of Regular Expressions discussed in this article support "viewing forward and backward ". The only exception is that Javascript only supports forward viewing.
(1) forward view of positive and negative statements
For example, if we look for a q, it is not followed by a u. That is to say, either q is not followed by a character or u is not followed. A solution for viewing forward with a negative expression is <q (?! U)>. The syntax for viewing the Negative Type forward is <(?! View content)>.
The positive view is similar to the negative view: <(? = View content)>.
If there is a group in the "View content" part, a backward reference is also generated. However, the forward view itself does not produce a backward reference, nor is it included in the number of the backward reference. This is because the forward view itself will be discarded, and only the matching results will be retained. If you want to retain the matching result as a backward reference, you can use <(? = (Regex)> to generate a backward reference.
(2) view the positive and negative statements successively
Backward viewing and forward viewing have the same effect, but in the opposite direction
The syntax for viewing the Negative Type backward is: <(? <! View content)>
The syntax for positive backward viewing is: <(? <= View content)>
We can see that, compared with the forward view, there is a left angle bracket that represents the direction.
Example: <(? <! A) B> it will match "B" without "a" as the leading character ".
It is worth noting that the regular expression "View" is matched from the current string position when viewing forward; the regular expression "View" is matched when viewing backward, and one character is traced back from the current string position, then, match the regular expression "View.
(3) go deep into the Regular Expression Engine
Let's look at a simple example.
Set the regular expression <q (?! U)> apply to the string "Iraq ". The first symbol of the regular expression is <q>. As we know, the engine will scan the entire string before <q> match. After the fourth character "q" is matched, "q" is followed by a void ). The next regular symbol is to view the forward. The engine noticed that it had entered a section for viewing the regular expression forward. The next regular expression is <u>, which does not match the null character, leading to a failure to match the regular expression in the forward view. Because it is a negative forward view, it means that the entire forward view result is successful. Therefore, the matching result "q" is returned.
We are applying the same regular expression to "quit ". <Q> "q" is matched ". The next regular symbol is the <u> of the forward part, which matches the second character "I" in the string ". The engine continues to go to the next character "I ". However, the engine noticed that the forward view has been completed and the forward view has been successful. Therefore, the engine discards the matched string, which causes the engine to roll back to the character "u ".
If you want to check the forward direction, it means that the successful matching of the view Part causes the entire forward view to fail. Therefore, the engine has to perform backtracking. Finally, because there are no other "q" and <q> matches, the entire match fails.
To ensure that you can clearly understand the implementation of the forward view, let's <q (? = U) I> apply to "quit ". <Q> match "q" first ". Then, check that the matching "u" is successful and the matching part is discarded. Only matching judgment results are returned. The engine rolls back from the character "I" to "u ". Because the forward query is successful, the engine continues to process the next regular symbol <I>. The result shows that <I> and "u" do not match. Therefore, the matching fails. The matching of the entire regular expression fails because there is no other "q" in the end.
(4) Further understanding of the internal mechanism of the Regular Expression Engine
Let's set <(? <= A) B> apply it to thingamabob ". The engine starts to process the regular symbols and the first character in the string. In this example, the Regular Expression Engine returns a character and checks whether "a" is matched. The engine cannot roll back because there are no characters before "t. Therefore, backward viewing fails. The engine continues to go to the next character "h ". Once again, the engine temporarily rolls back a character and checks whether "a" is matched. The result shows a "t ". An error occurred while viewing back.
Looking back, continue to fail, until the regular expression reaches the "m" in the string, so certainly backward view is matched. Because it is zero-length, the current position of the string is still "m ". The next regular expression is <B>, which fails to match "m. The next character is the second "a" in the string ". The engine temporarily rolls back a character and finds that <a> it does not match "m ".
The next character is the first "B" in the string ". When the engine temporarily returns a character, it finds that it is satisfied with the backward view, and <B> matches "B ". Therefore, the entire regular expression is matched. As a result, the regular expression returns the first "B" in the string ".
(5) Applications viewed forward and backward
Let's look at an example of a 6-character Word containing "cat.
First, we can solve the problem without looking forward and backward, for example:
<Cat \ w {3} | \ wcat \ w {2} | \ w {2} cat \ w | \ w {3} cat>
Easy enough! However, this method becomes clumsy when you need to find a word with 6-12 characters including "cat", "dog", or "mouse.
Let's take a look at the forward view solution. In this example, we have two basic requirements: first, we need a 6-character, and second, the word contains "cat ".
The regular expression that meets the first requirement is <\ B \ w {6} \ B>. The regular expression that meets the second requirement is <\ B \ w * cat \ w * \ B>.
By combining the two, we can get the following regular expression:
<(? = \ B \ w {6} \ B) \ B \ w * cat \ w * \ B>
The specific matching process is left to the reader. However, it is important to note that forward viewing does not consume characters. Therefore, when a word is judged to meet the Six-character condition, the engine will continue to match the regular expression from the beginning.
Finally, we can get the following regular expression:
<\ B (? = \ W {6} \ B) \ w {0, 3} cat \ w *>
15. Conditional test in Regular Expressions
The condition test syntax is <(? Ifthen | else)>. The "if" part can be a forward and backward view expression. If you use forward view, the syntax is changed to: <(? (? = Regex) then | else)>. The else part is optional.
If the if part is true, the Regular Expression Engine tries to match the then part; otherwise, the engine tries to match the else part.
It should be noted that the forward view does not actually consume any characters, so the matching between then and else starts from the part before the if test.
16. Add comments to Regular Expressions
The syntax for adding comments to a regular expression is: <(? # Comment)>
For example, add a comment for a regular expression used to match a valid date:
(? # Year) (19 | 20) \ d [-/.] (? # Month) (0 [1-9] | 1 [012]) [-/.] (? # Day) (0 [1-9] | [12] [0-9] | 3 [01])
Here, we have finished introducing the knowledge of regular expressions! Hope to help you.