Regular Expressions Detailed introduction (bottom)

Regular Expressions Detailed introduction (bottom) _ Regular expressions

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is a sequel to the detailed introduction of regular expressions (above) in the previous article, in this article, we describe the groups and backward references in regular expressions, previous views backward, conditional testing, word boundaries, selectors, and other expressions and examples, and analyzed the internal mechanism of the regular engine in performing the match.
9. Word Boundaries

Metacharacters <<\b>> is also a "anchor" for matching locations. This match is a 0-length match.

There are 4 types of locations that are considered "word boundaries":

1 position before the first character of the string (if the first character of the string is a "word character")

2 The position after the last character of the string (if the last character of the string is a "word character")

3 between a "word character" and "non-word character", where "non-word character" immediately after "word character"

4 between a "non-word character" and "word character", where "word character" immediately after "non-word character"

A "word character" is a character that can be matched with "\w", and "non-word character" is a character that can be matched with "\w". In most regular expression implementations, "word characters" usually include <<[a-zA-Z0-9_]>>.

For example,:<<\b4\b>> can match a single 4 rather than a portion of a larger number. This regular expression does not match the 4 in "44".

In other words, you can almost say <<\b>> match the start and end position of an alphanumeric sequence.

The "Word boundary" is set to <<\b>>, and the position he wants to match is between two "word characters" or two "non-word characters".

Deep inside the regular expression engine

Let's take a look at applying the regular expression <<\bis\b>> to the string "This island is Beautiful". The engine processes symbol <<\b>> first. Since the \b is 0 length, the position of the first character T front will be examined. Because T is a "word character", the preceding character is a null character (void), so \b matches the word boundary. Then the <> and the first character "T" match failed. The matching process continues until the fifth spaces, and the fourth character "s" matches the <<\b>>.

However, spaces and <> do not match. Continues backwards, to the sixth character "I", and the fifth space character matches the <<\b>>, and then <<is>> matches the six and seventh characters. However, the eighth character and the second word boundary do not match, so the match fails again. To the 13th character I, because the "word boundary" is formed with the previous spaces and the <<is>> is matched with "is". The engine then tries to match the second <<\b>>. Because the 15th spaces and "s" form a word boundary, the match succeeds. The engine is "in a hurry" to return the result of a successful match.

10. Selection character

"|" in regular expressions Represents a selection. You can match one of several possible regular expressions with a selector.

If you want to search for words like "cat" or "dog", you can use <<cat|dog>>. If you want to have more options, you just expand the list <<cat|dog|mouse|fish>>.

A selector has the lowest priority in a regular expression, that is, it tells the engine to either match all the expressions to the left of the selector or match all the expressions on the right. You can also use parentheses to limit the range of selectors. such as <<\b (Cat|dog) \b>>, which tells the regular engine to treat (Cat|dog) as a regular expression unit.

Pay attention to the "Rush Biaogong" nature of the regular engine

The regular engine is urgent, and when it finds a valid match, it stops the search. Therefore, under certain conditions, the order of the expressions on both sides of the selector will have an effect on the result. Suppose you want to use regular expressions to search for a list of functions in a programming language: Get,getvalue,set or SetValue. A clear solution is <<get|. getvalue| Set| Setvalue>>. Let's look at the results when searching for SetValue.

Because <<Get>> and <<GetValue>> failed, the <<Set>> match was successful. Because the regular-oriented engine is "eagerly", it returns the first successful match, "Set", rather than continuing to search for a better match.

Contrary to our expectations, the regular expression does not match the entire string. There are several possible solutions. One is to take into account the "urgent" nature of the regular engine, change the order of options, such as we use <<getvalue| Get| setvalue| Set>>, so that we can first search for the longest match. We can also combine four options into two options: <<get (Value)? | Set (Value)?>>. Because the question mark repeat is greedy, SetValue will always be matched before set.

11. Group and backward references

Put part of the regular expression in parentheses, and you can form them into groups. You can then use some regular actions for the entire group, such as repeat operators.

Note that only the parentheses "()" can be used to form a group. [] is used to define the character set. ' {} ' is used to define duplicate operations.

When a regular expression group is defined with "()", the regular engine stores the matched groups in sequential numbers and caches them. When the matched group is referenced backwards, it can be referenced using the "\ Number" method. <<\1>> reference the first matching,<<\2>> reference group, and then the second group, and so on,<<\n>> references the nth group. <<\0>> refers to the entire matched regular expression itself. Let's look at an example.

Suppose you want to match the start and end tags of an HTML tag, as well as the text in the middle of the label. For example this is a test, we want to match and as well as the middle text. We can use the following regular expression: "< ([a-z][a-z0-9]*) [^>]*>.*?</\1>"

First, "<" will match the first character "<" of "". Then [A-z] match b,[a-z0-9]* will match 0 to multiple alphanumeric characters followed by 0 to more than ">." The ">" of the last regular expression will match the ">" of "". The regular engine then lazily matches the characters before the end tag until a "</" symbol is encountered. The "\1" in the regular expression then refers to the previously matched group "([a-z][a-z0-9]*)", in this case, the label name "B" is referenced. So need to be matched at the end of the label ""

You can make multiple references to the same,<< reference group ([a-c]) x\1x\1>> will match "Axaxa", "Bxbxb", and "CXCXC". If a group referenced in numbers does not have a valid match, the referenced content is simply empty.

A back reference cannot be used for itself. << ([abc]\1) >> is wrong. So you can't use <<\0>> for a regular expression to match itself, it can only be used in substitution operations.

The back reference cannot be used within the character set. <<\1>> in the << (a) [\1b]>> does not represent a forward reference. Within the character set the,<<\1>> can be interpreted as the octal form of the transcoding.

A backward reference lowers the engine's speed because it needs to store matching groups. If you do not need to refer back, you can tell the engine not to store a group. For example: <<get (?: Value) >>. where "(" followed by "?:" tells the engine that for group (value), no matching value is stored for a back reference.)

(1) Repeat operation and back reference

When a repeating operator is used on a group, the post-referenced content in the cache is refreshed and only the last match is kept. For example,:<< ([abc]+) =\1>> will match "Cab=cab", but << ([ABC]) +=\1>> will not. Because ([ABC]) matches "C" for the first time, "\1" represents "C", and ([ABC]) continues to match "a" and "B". Finally "\1" represents "B", so it matches "cab=b".

Application: Check for repeated words-when editing text, it is easy to enter repeated words, such as "the". These duplicate words can be detected using <<\b (\w+) \s+\1\b>>. To remove the second word, simply replace "\1" with the replacement function.

(2) Naming and referencing of groups

In Php,python, you can use the << (? P<name>group) >> to name the group. In this case, the lexical? P<name> is naming groups (group). Where name is your name for the group. You can use (? P=name) for reference.

. NET-named group

The. NET framework also supports named groups. Unfortunately, Microsoft programmers decided to invent their own grammar instead of using Perl and Python rules. So far, no other regular expression has been implemented to support the syntax of Microsoft's invention.

Here's the. NET in the example:

(? <first>group) (?' Second ' group)

As you can see,. NET provides two kinds of lexical to create named groups: one is to use the angle bracket "<>" or the single quotation mark "" ". Angle brackets are easier to use in strings, and single quotes are more useful in ASP code, because "<>" in ASP code is used as an HTML tag.

To refer to a named group, use \k<name> or \k ' name '.

When searching for a replacement, you can use "${name}" to refer to a named group.

12. Matching pattern of regular expressions

The regular expression engines discussed in this tutorial support three matching modes:

<> makes regular expressions insensitive to capitalization,

<</s>> Open "single mode", point number "." Match New Line character

<</m>> opens multiline mode, where "^" and "$" match the front and rear positions of new line characters.

Open or close a pattern inside a regular expression

If you insert a modifier (? ISM) inside a regular expression, the modifier only works on the regular expression on the right side of the formula. (? i) is off-case insensitive. You can test it quickly. The << (? i) te (? i) st>> should match test, but it cannot match test or test.

13. Atomic groups and preventing backtracking

In some special cases, because backtracking can make the engine extremely inefficient.

Let's look at one example: to match such a string, each field in a string is delimited by a comma, and the 12th field begins with P.

It is easy to think of such regular expression <<^ (. *?,) {11}p>>. This regular expression works well under normal circumstances. In extreme cases, however, catastrophic backtracking occurs if the 12th field is not preceded by P. If the string you want to search for is "1,2,3,4,5,6,7,8,9,10,11,12,13". First, the regular expression has been successfully matched until the 12th character. At this point, the preceding regular expression consumes the string "1,2,3,4,5,6,7,8,9,10,11," and to the next character,<> does not match "12". So the engine is backtracking, when the regular expression consumes the string "1,2,3,4,5,6,7,8,9,10,11". Proceed to the next match, the next regular symbol is the dot <<.>>, and you can match the next comma ",". However, <<,>> does not match the "1" in the character "12". Match failed, continue backtracking. As you can imagine, such a retrospective combination is a very large number. This could cause the engine to crash.

There are several scenarios for preventing such a huge backtracking:

A simple solution is to make the match as accurate as possible. Use the inverse character set instead of the dot number. For example, we use the following regular expression <<^ ([^,\r\n]*,) {11}p>>, so that the number of failed backtracking can be reduced to 11 times.

Another scenario is to use atomic groups.

The purpose of the atomic group is to make the regular engine fail a little faster. Therefore, it can effectively prevent massive backtracking. The syntax of an atomic group is << (?> regular expression) >>. All regular expressions located between (?>) are considered to be a single regular symbol. Once the match fails, the engine will backtrack back to the regular expression section in front of the atomic group. The preceding example uses atomic groups to express <<^ (?> (. *?,) {one}) p>>. Once the 12th field match fails, the engine goes back to the <<^>> in front of the atomic group.

14. Forward View and backward view

Perl 5 introduces two powerful regular syntax: "Look forward" and "view backwards." They are also known as "0-length assertions". They are 0 lengths like anchors (the so-called 0-length means that the regular expression does not consume the matched string). The difference is that "before and after" will actually match the characters, but they will discard the match only to return the matching result: match or mismatch. That is why they are called "assertions". They do not actually consume the characters in the string, but simply assert that a match is possible.

Almost all of the regular expression implementations discussed in this article support "View backwards". The only exception is that JavaScript only supports forward viewing.

(1) Positive and negative forward viewing

As we have mentioned earlier: to find a Q, there is no following a U. That is, either there is no character behind the Q, or the following character is not U. A solution with negative forward view is <<q (?!). u) >>. The syntax for negative forward viewing is << (?! What to view) >>.

Positive forward view and negative forward view very similar to:<< (? = view content) >>.

If there is a group in the "Viewed Content" section, a backward reference is also generated. However, the forward view itself does not produce a backward reference and is not counted in the number that is referenced backwards. This is because the forward view itself is discarded, leaving only the judgement of the match or not. If you want to keep the matching result as a backward reference, you can use the << (? = (regex)) >> to produce a backward reference.

(2) Affirmative and negative view

Look backward and look forward with the same effect, just in the opposite direction

The syntax for negative backward viewing is:<< (? <! view) >>

The syntax for affirming backward viewing is:<< (? <= view) >>

We can see that, compared to the forward view, there is an extra left angle bracket representing the direction.

Example:<< (? <!a) b>> will match a "B" without "a" as the leading character.

It is worth noting that the forward view matches the "view" regular expression starting at the current string position, and the backward view begins by backtracking a character from the current string position before starting to match the view regular expression.

(3) deep inside the regular expression engine

Let's look at a simple example.

<<q the regular expression (?!) u) >> applied to the string "Iraq". The first symbol of a regular expression is <<q>>. As we know, the engine sweeps through the entire string before matching <<q>>. When the fourth character "Q" is matched, "Q" is followed by a null character (void). And the next regular symbol is viewed forward. The engine noted that a forward view of the regular expression section had been entered. The next regular symbol is <>, and the null character does not match, causing the regular expression match in the Forward view to fail. Because it is a negative forward view, it means that the entire forward view results are successful. So the match result "Q" is returned.

We are applying the same regular expression to "quit". <<q>> matches the "Q". The next regular symbol is the <> of the forward viewing section, which matches the second character "I" in the string. The engine continues to go to the next character "I". However, the engine noticed that the forward viewing section had been processed and the forward view had been successful. The engine then discards the string part that is matched, which causes the engine to fall back to the character "U".

Because looking forward is a stereotype, it means that viewing a part of a successful match causes the entire forward view to fail, so the engine has to backtrack. Finally because there is no other "Q" and <<q>> match, so the whole match failed.

To ensure that you can clearly understand the implementation of the Forward view, let's apply <<q (=u) i>> to "quit". <<q>> first matches "Q". Then look forward to the successful match "U", the matching part is discarded, only return can match the judgment result. The engine returns from the character "I" to "U". The engine continues to process the next regular symbol <> because the forward view succeeds. The result found that <> and "U" did not match. Therefore, the match failed. The entire regular expression has failed to match because there is no other "q" behind it.

(4) Further understanding of the internal mechanism of regular expression engines

Let's apply the << (<=a) b>> to "Thingamabob". The engine begins processing the back-viewing portion of the regular symbol and the first character in the string. In this example, the backward view tells the regular expression engine to rollback a character and then see if a "a" is matched. The engine cannot be rolled back because there are no characters in front of "T". Therefore, the backward view failed. The engine continues to go to the next character "H". Again, the engine briefly returns a character and checks to see if a "a" is matched. The result found a "t". The backward view failed again.

The backward view continues to fail until the regular expression reaches the "M" in the string, and the affirmative backward view is matched. Because it is 0-length, the current position of the string is still "M". The next regular symbol is <>, and "M" Match failed. The next character is the second "a" in the string. The engine briefly returns a character backward and finds that <<a>> does not match "M".

The next character is the first "B" in the string. The engine temporarily backwards a character finds that the backward view is satisfied, while the <> matches the "B". So the entire regular expression is matched. As a result, the regular expression returns the first "B" in the string.

(5) Backward view of the application

Let's look at an example that looks for a word with a 6-bit character that contains "cat".

First, we can solve the problem without looking forward backwards, for example:

<< cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat>>

It's easy enough! But when the demand turns to finding a word that has 6-12-bit characters and contains "cat", "dog" or "mouse", this approach becomes awkward.

Let's take a look at the scenario using forward viewing. In this example, we have two basic requirements to meet: one is that we need a 6-bit character, and the second is that the word contains "cat".

The regular expression that satisfies the first requirement is <<\b\w{6}\b>>. The regular expression that satisfies the second requirement is <<\b\w*cat\w*\b>>.

By combining the two, we can get the following regular expressions:

<< (? =\b\w{6}\b) \b\w*cat\w*\b>>

The specific matching process is left to the reader. However, it is important to note that a forward view does not consume characters, so when a word is judged to satisfy a 6-character condition, the engine continues to match the subsequent regular expression from the position before the start of the decision.

Finally, you can get the following regular expression:

<<\b (? =\w{6}\b) \w{0,3}cat\w*>>

15. Conditional testing in regular expressions

The syntax for the conditional test is << (? ifthen|else) >>. The "If" section can be a forward-backward view of an expression. If you look forward, the syntax changes to:<< (? =regex) then|else) >>, where the else part is optional.

If the if part is true, the regular engine attempts to match the then part, otherwise the engine attempts to match the else part.

It is to be remembered that the forward view does not actually consume any characters, so the subsequent then with the else part begins with the part before the if test.

16. Add a comment for a regular expression

The syntax for adding annotations in regular expressions is:<< (? #comment) >>

Example: Add a comment to a regular expression used to match a valid date:

(? #year) (19|20) \d\d[-/.] (? #month) (0[1-9]|1[012]) [- /.] (? #day) (0[1-9]| [12] [0-9]|3[01])

Here, the regular expression of knowledge, the introduction is over! I hope it will help you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More