The regular expression in layman's language (II.)

Source: Internet
Author: User
Tags opening and closing tags expression engine support microsoft

Http://dragon.cnblogs.com/archive/2006/05/09/394923.html

Objective:
This article is a sequel to the previous article, "The regular expression of simple words (a)", in this paper, we describe the group and backward references in regular expressions, the previous view, the condition test, the word boundary, the selector and other expressions and examples, and analyze the internal mechanism of the regular engine in the execution of matching.
This article is a translation of the tutorial written by Jan Goyvaerts for Regexbuddy, which is copyrighted by the original author and is welcome to reprint. But in order to respect the work of the original author and translator, please specify the source! Thank you!


9. Word boundaries

Metacharacters <<\b>> is also a "anchor" to match the position. This match is a 0-length match.

There are 4 types of positions that are considered "word boundaries":

1) position before the first character of the string (if the first character of the string is a "word character")

2) position after the last character of the string (if the last character of the string is a "word character")

3) between a "word character" and a "non-word character", where "non-word character" immediately follows "word character"

4) between a "non-word character" and a "word character", where "word character" immediately follows "non-word character"

"Word character" is a character that can be matched with "\w", and "non-word character" is a character that can be matched with "\w". In most regular expression implementations, "word characters" usually include <<[a-zA-Z0-9_]>>.

For example,:<<\b4\b>> can match a single 4 rather than a portion of a larger number. This regular expression does not match 4 in "44".

In other words, it can almost be said that <<\b>> matches the beginning and end of an "alphanumeric sequence".

The inverse set of the word boundary is <<\b>> the position to match is between two "word characters" or two "non-word characters".

    • Deep inside the regular expression engine

Let's take a look at the regular expression <<\bis\b>> applied to the string "This island is Beautiful". The engine processes symbol <<\b>> first. Because the \b is 0 length, the position in front of the first character T is examined. Because T is a "word character" and the character in front of it is a null character (void), \b matches the word boundary. Then <<i>> and the first character "T" failed to match. The matching process continues until the fifth space, and the fourth character "S", also match the <<\b>>. However, the whitespace and <<i>> do not match. Continue backwards, to the sixth character "I", and the fifth space character to match the <<\b>>, and then <<is>> and the first and seventh characters are matched. However, the eighth character does not match the second "word boundary", so the match fails again. To the 13th character I, because it forms a "word boundary" with the preceding one, and the <<is>> is matched with "is". The engine then tries to match a second <<\b>>. Because the 15th space character and "s" form the word boundary, the match succeeds. The engine "urgently" returns the result of a successful match.

Ten. selectors

The "|" in the regular expression Represents a selection. You can match one of several possible regular expressions with a selector.

If you want to search for the word "cat" or "dog", you can use <<cat|dog>>. If you want to have more options, you just expand the list <<cat|dog|mouse|fish>>.

The selector has the lowest priority in the regular expression, that is, it tells the engine to either match all the expressions to the left of the selector, or match all the expressions on the right. You can also use parentheses to limit the scope of the selector. As <<\b (Cat|dog) \b>>, this tells the regular engine to treat (Cat|dog) as a regular expression unit.

    • Note the "Rush Biaogong" nature of the regular engine

The regular engine is eager, and when it finds a valid match, it stops searching. Therefore, under certain conditions, the order of the expressions on either side of the selector will have an effect on the result. Suppose you want to use regular expressions to search for a list of functions for a programming language: Get,getvalue,set or SetValue. A clear solution is <<get| getvalue| Set| Setvalue>>. Let's look at the results when searching for SetValue.

Because <<Get>> and <<GetValue>> failed, and the <<Set>> match was successful. Because the regular-oriented engine is "eager", it returns to the first successful match, which is "Set" instead of continuing to search for a better match.

Contrary to what we expected, the regular expression does not match the entire string. There are several possible workarounds. One is to consider the "eager" nature of the regular engine and change the order of the options, for example, we use <<getvalue| Get| setvalue| Set>>, so that we can first search for the longest match. We can also combine four options into two options: <<get (Value)? | Set (Value)?>>. Because the question mark repetition is greedy, setvalue is always matched before set.

A better solution is to use the word boundary: <<\b (get| getvalue| Set| SetValue) \b>> or <<\b (Get (Value)? | Set (Value)?\b>>. Further, since all the choices have the same ending, we can optimize the regular expression to <<\b (get| Set) (Value)?\b>>.

Group and backward references

Put a part of the regular expression inside the parentheses, and you can form them into groups. You can then use some regular actions for the entire group, such as repeat operators.

Note that only the parentheses "()" can be used to form groups. "[]" is used to define the character set. ' {} ' is used to define duplicate operations.

When a regular expression group is defined with "()", the regular engine will sequentially number the matched group in the cache. You can refer to a "\ Number" method when you are referencing a matched group backwards. <<\1>> references the first matched back reference group,<<\2>> references the second group, and so on,<<\n>> refers to the nth group. <<\0>> refers to the entire matched regular expression itself. Let's look at an example.

Suppose you want to match the opening and closing tags of an HTML tag, as well as the text in the middle of the label. For example <b>this is a test</b> we want to match <B> and </B> and middle text. We can use the following regular expression: "< ([a-z][a-z0-9]*) [^>]*>.*?</\1>"

First, "<" will match the first character "<" of "<B>". Then [A-z] match b,[a-z0-9]* will match 0 to multiple alphanumeric, followed by 0 to more characters that are not ">". The ">" of the last regular expression will match the ">" of "<B>". The regular engine then lazily matches the character before the end tag until it encounters a "</" symbol. Then "\1" in the regular expression represents a reference to the previously matched group "([a-z][a-z0-9]*)", in this case the label name "B" is referenced. So we need to match the end tag to "</B>"

You can make multiple references to the same back reference group,<< ([a-c]) x\1x\1>> will match "Axaxa", "Bxbxb", and "CXCXC". If a group that is referenced numerically does not have a valid match, the content that is referenced is simply empty.

A back reference cannot be used on its own. << ([abc]\1) >> is wrong. So you can't use <<\0>> for a regular expression match itself, it can only be used in a replacement operation.

A back reference cannot be used inside a character set. << (a) <<\1>> in [\1b]>> does not indicate a back reference. Inside the character set,<<\1>> can be interpreted as an octal form of transcoding.

A backward reference reduces the speed of the engine because it needs to store matching groups. If you don't need a backward reference, you can tell the engine not to store it for a group. For example: <<get (?: Value) >>. Where the "?:" Next to "(") will tell the engine that for group (value), the matching value is not stored for a back reference.

    • Repeating and back references

When a repeating operator is used on a group, the back reference content in the cache is refreshed continuously, preserving only the last matching content. For example:<< ([abc]+) =\1>> will match "Cab=cab", but << ([ABC]) +=\1>> will not. Because ([ABC]) matches "C" for the first time, "\1" stands for "C", and then ([ABC]) continues to match "a" and "B". The last "\1" represents "B", so it will match "Cab=b".

Application: Check for duplicate words--when editing text, it is easy to enter duplicate words, such as "the The". These repeating words can be detected using the <<\b (\w+) \s+\1\b>>. To remove the second word, simply replace the "\1" with the Replace function.

    • Naming and referencing of groups

In Php,python, you can use << (? P<name>group) >> to name the group. In this case, the lexical? The p<name> is named after the group. Where name is the name you have for the group. You can use (? P=name) for reference.

. NET's named groups

The. NET framework also supports named groups. Unfortunately, Microsoft's programmers decided to invent their own grammar, rather than follow the rules of Perl and Python. So far, no other regular expression has been implemented to support Microsoft's invented syntax.

Here's the. NET in the example:

(? <first>group) (?’ Second ' group)

As you can see,. NET provides two lexical types to create named groups: one is to use the angle brackets "<>", or the single quotation mark "". Angle brackets are easier to use in strings, and single quotes are more useful in ASP code because the "<>" in ASP code is used as an HTML tag.

To refer to a named group, use \k<name> or \k ' name '.

When a search is replaced, you can refer to a named group with "${name}".

Matching patterns of regular expressions

The regular expression engine discussed in this tutorial supports three matching modes:

<</i>> makes regular expressions insensitive to casing,

<</s>> turn on "single-line mode", or "dot". Match New Line character

<</m>> turn on Multiline mode, which is "^" and "$" to match the front and back positions of the new line character.

    • Open or close a pattern inside a regular expression

If you insert a modifier (? ISM) inside a regular expression, the modifier only works on the regular expression to the right of it. (?-i) is off case insensitive. You can test it quickly. The << (? i) te (?-i) st>> should match test, but cannot match test or test.

Atomic groups and anti-backtracking

In some special cases, because backtracking can make the engine extremely inefficient.

Let's look at an example: to match such a string, each field in the string is delimited by commas, and the 12th field starts with P.

It is easy to think of such a regular expression <<^ (. *?,) {11}p>>. This regular expression works well under normal circumstances. However, in extreme cases, if the 12th field is not preceded by P, a catastrophic backtracking occurs. The string to search for is "1,2,3,4,5,6,7,8,9,10,11,12,13". First, the regular expression has been successfully matched until the 12th character. At this point, the preceding regular expression consumes a string of "1,2,3,4,5,6,7,8,9,10,11," and the next character,<<p>> does not match "12". So the engine is backtracking, and the string consumed by the regular expression is "1,2,3,4,5,6,7,8,9,10,11". Continue the next matching process, the next regular symbol is the dot <<.>>, and you can match the next comma ",". However <<,>> does not match "1" in the character "12". Match failed, continue backtracking. As you can imagine, this combination of backtracking is a very large number. This can cause the engine to crash.

There are several scenarios for preventing such a huge backtracking:

A simple solution is to make the match as accurate as possible. Use the inverse character set instead of the dot number. For example, we use the following regular expression <<^ ([^,\r\n]*,) {11}p>>, which can reduce the number of failed backtracking to 11 times.

Another option is to use atomic groups.

The purpose of the atomic group is to make the regular engine fail faster. Therefore, it can effectively prevent the massive backtracking. The syntax for an atomic group is << (?> regular expression) >>. All regular expressions that are located between (?>) are considered to be a single regular symbol. Once the match fails, the engine will backtrack to the regular expression section in front of the atomic group. The preceding example can be expressed as an atomic group of <<^ (?> (. *,) {one}) p>>. Once the 12th field match fails, the engine goes back to the <<^>> in front of the atomic group.

Forward View and backward view

Perl 5 introduces two powerful regular grammars: "View forward" and "view Backwards". They are also called "0-length assertions". They are 0 of the length of the anchor (the so-called 0 length means that the regular expression does not consume the matched string). The difference is that "view before and after" will actually match the characters, except that they will discard the match only to return the match result: match or mismatch. This is why they are called "assertions". They do not actually consume the characters in the string, but simply assert that a match is possible.

Almost all of the regular expression implementations discussed in this article support "forward and backward viewing." The only exception is that JavaScript only supports forward viewing.

    • Positive and negative forward view

As we mentioned earlier in this example: to find a Q, followed by a U. That is, either Q is not followed by a character, or the following character is not U. A solution with negative forward view is <<q (?! u) >>. The syntax for negative forward viewing is << (?! View) >>.

Positive forward viewing and negative forward viewing are similar to:<< (? = Viewing) >>.

If there is a group in the "Content viewed" section, a backward reference is also generated. But the forward view itself does not produce a backward reference and is not counted in the backward-referenced number. This is because the forward view itself is discarded, leaving only the results of the match. If you want to keep the matching result as a backward reference, you can use << (? = (regex)) >> to produce a backward reference.

    • Affirmative and negative view of the succession

Backward viewing and forward viewing have the same effect, except in the opposite direction

The syntax for negative backward viewing is:<< (? <! view) >>

The syntax for affirmative-backward viewing is:<< (? <= view content) >>

We can see that, compared to the forward view, there is a left angle bracket that represents the direction.

Example:<< (? <!a) b>> will match a "B" that does not have "a" as the leading character.

It is worth noting that the forward view matches the "view" regular expression starting at the current string position; Backward view starts from the current string position, backtracking one character, and then begins to match the "view" regular expression.

    • Deep inside the regular expression engine

Let's look at a simple example.

<<q the regular expression (?! u) >> apply to the string "Iraq". The first symbol of a regular expression is <<q>>. As we know, the engine will sweep through the entire string before matching <<q>>. When the fourth character "Q" is matched, "Q" is followed by a null character (void). The next regular symbol is viewed forward. The engine notices that it has entered a forward-looking regular expression section. The next regular symbol is <<U>>, and the null character does not match, resulting in a regular expression match failure in the forward view. Because it is a negative forward view, it means that the entire forward view results are successful. So the matching result "Q" was returned.

We are applying the same regular expression to "quit". <<q>> matches "Q". The next regular symbol is the <<u>> of the Forward View section, which matches the second character "I" in the string. The engine continues to go to the next character "I". However the engine noticed at this point that the Forward View section had already been processed and the forward View had succeeded. The engine then discards the matched string part, which causes the engine to fall back to the character "U".

Because checking the stereotypes forward means that the successful match of the viewing part leads to the entire forward view failure, so the engine has to backtrack. Finally because there is no other "Q" and <<q>> match, so the entire match failed.

To ensure that you can clearly understand the implementation of the Forward view, let's apply the <<q (? =u) i>> to "quit". <<q>> first matches "Q". Then look ahead to the successful match "U", and the matching part is discarded, returning only the judging results that can match. The engine rolls back from the character "I" to "U". As the forward view succeeds, the engine continues to process the next regular symbol <<i>>. The results found that <<i>> and "U" did not match. So the match failed. Because there is no other "Q", the matching of the entire regular expression fails.

    • Further understanding of the internal mechanism of the regular expression engine

Let's apply the << (? <=a) b>> to "Thingamabob". The engine begins processing the regular symbol of the backward View section and the first character in a string. In this example, the backward view tells the regular expression engine to roll back one character and then see if there is a match for "a". The engine cannot be rolled back because there are no characters in front of "T". So the backward view failed. The engine continues to go to the next character "H". Once again, the engine temporarily rolls back a character and checks if a "a" is matched. As a result, a "T" was found. Look backwards and fail again.

The backward view continues to fail until the regular expression reaches the "M" in the string, so the affirmative-backward view is matched. Because it is 0 length, the current position of the string is still "M". The next regular symbol is <<B>>, and the match "M" fails. The next character is the second "a" in a string. The engine temporarily rolls back one character and finds <<a>> does not match "M".

The next character is the first "B" in the string. The engine temporarily backwards one character to find the backward view is satisfied, while <<b>> matches "B". So the entire regular expression is matched. As a result, the regular expression returns the first "B" in a string.

    • Apps that look forward and backward

Let's look at an example: Find a word with a 6-bit character that contains "cat".

First, we can solve the problem without looking forward and backward, for example:

<< cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat>>

It's easy enough! But this approach becomes awkward when the demand becomes a search for a word with a 6-12-bit character that contains "cat", "dog" or "mouse".

Let's take a look at the scenario using forward viewing. In this example, we have two basic requirements to satisfy: first, we need a 6-bit character, and the second word contains "cat".

The regular expression that satisfies the first requirement is <<\b\w{6}\b>>. A regular expression that satisfies the second requirement is <<\b\w*cat\w*\b>>.

To combine the two, we can get the following regular expression:

<< (? =\b\w{6}\b) \b\w*cat\w*\b>>

The specific matching process is left to the reader. However, it is important to note that the forward view is not consumed by characters, so when the word satisfies a condition with 6 characters, the engine will continue to match the following regular expression from the position before it starts to judge.

Finally, some optimizations can be made to get the following regular expressions:

<<\b (? =\w{6}\b) \w{0,3}cat\w*>>

The conditional test in regular expressions

The syntax for the conditional test is << (? ifthen|else) >>. The "If" section can be a forward-backward view of the expression. If you view forward, the syntax changes to:<< (? =regex) then|else) >>, where the else part is optional.

If the if part is true, the regular engine tries to match the then part, otherwise the engine tries to match the else part.

It is to be remembered that the forward view does not actually consume any characters, so the subsequent then-to-the-else portion of the match is attempted from the part before the if test.

Add a comment to a regular expression

The syntax for adding comments to a regular expression is:<< (? #comment) >>

Example: Add a comment for a regular expression that matches a valid date:

(? #year) (19|20) \d\d[-/.] (? #month) (0[1-9]|1[012]) [- /.] (? #day) (0[1-9]| [12] [0-9]|3[01])

The regular expression in layman's language (II.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.