Detailed explanation of Regular Expression grouping and assertions

Source: Internet
Author: User

Tip: reading this article requires a certain regular expression basis.

 

The assertions in regular expressions appear as advanced applications, not because of their complexity, but because their concepts are abstract and hard to understand. Let's explain them in a simple way today.

Without assertions, the expressions used in the past can only obtain regular strings, rather than irregular strings.

For example, if the HTML source code contains the <title> XXX </title> tag, we can only determine that the <title> and </title> in the source code are fixed. Therefore, if you want to obtain the page title (XXX), you can only write an expression similar to this: <title>. * </title>. The complete <title> XXX </title> tag is not simply the page title XXX.

To solve the above problems, we need to use assertion knowledge.

Before talking about assertions, the reader should first understand the grouping, which helps to understand assertions.

The group is represented by () in the regular expression. According to the understanding of the dish, the group has two functions:

 

N think of some rules as a group, and then repeat them at the group level to achieve unexpected results.

After grouping N, you can use backward reference to simplify the expression.

 

 

First, let's look at the first role. For IP address matching, the simple form can be written as follows:

\ D {1, 3}. \ D {1, 3}. \ D {1, 3}. \ D {1, 3}

But after careful observation, we can find a certain rule. \ D {1, 3} can be regarded as a whole, that is, they can be regarded as a group, and then this group can be repeated three times. The expression is as follows:

\ D {1, 3} (. \ D {1, 3}) {3}

In this way, it is concise.

Let's take a look at the second function. For the matching <title> XXX </title> tag, a simple regular expression can be written like this:

<Title>. * </title>

It can be seen that there are two titles in the above expression, which are exactly the same. In fact, you can use group abbreviations. The expression is as follows:

<(Title)>. * </\ 1>

This example is actually a practical application of reverse reference. For groups, the entire expression is always counted as 0th groups. In this example, the 0th groups are <(title)>. * </\ 1>, and the group numbers are displayed from left to right. Therefore, the title group is 1st.

With the \ 1 syntax, You can reference a groupText Content, \ 1 is to reference the 1st sets of text content. In this way, you can simplify the regular expression, write the title only once, put it in the group, and then reference it later.

Inspired by this, can we simplify the IP address regular expression just now? The original expression is \ D {1, 3 }(. \ D {1, 3}) {3}, where \ D {1, 3} is repeated twice. If backward reference is simplified, the expression is as follows:

(\ D {1, 3}) (. \ 1) {3}

Put \ D {1st} in a group as (\ D {}), which is a group ,(. \ 1) is the 2nd group. In the 2nd group, the \ 1 syntax is used to reference the 1st groupText Content.

After actual tests, we will find that this writing is incorrect. Why?

Dishes have been emphasizing,Backward Reference refers to only text content, not regular expressions!

That is to say, once the content in the group matches successfully, it is referenced backward,The referenced content is the content after the matching is successful. It references the result, not the expression..

Therefore, the expression (\ D {}) (. \ 1) {3} Actually matches four IP addresses with the same number, for example, 123.123.123.123.

So far, readers have mastered the legendary back-to-back reference, which is so simple.

 

Next, let's talk about assertion.

 

The so-called assertion indicates the front or back side of a string, and a string meeting a certain rule will appear.

TakeArticleIn the example at the beginning, we want XXX, Which is irregular, but there must be <title> at the front and </title> at the back. This is enough.

If you want to specify XXX, <title> will appear before, and then use the post-positive asserted expression :(? <= <Title> ).*

After a request is sent to a specified XXX, it will certainly appear </title>. The expression is :.*(? = </Title>)

The two are combined, that is (? <= <Title> ).*(? = </Title>)

In this way, xxx can be matched.

I believe that the reader is blind to this. Don't worry. Let's talk about it later.

In fact, I have mastered the rule,It is very simple, whether it is first or later, it is relativeXxxThat is, relative to the target string.

If there is a condition behind the target string, it can be understood as the target string is in the front, first asserted, placed after the target string.

If there are conditions on the front side of the target string, it can be understood that the target string is placed before the target string, and then the post-development assertions are used.

If a condition is specified, it is positive.

If a condition is not met, it is negative.

Assertion is only a condition. It helps you find the strings you actually need and it does not match!

(? = X)

Assertion with zero width. The matching continues only when the child expression x matches the right side of the position. For example,/W + (? =/D) match the word followed by a number instead of the number. This construction will not be traced back.

(?! X)

Assertion with Zero Width and negative first. The matching continues only when the child expression X does not match the right side of the position. For example,/W + (?! /D) the word that does not match the digit, but does not match the digit.

(? <= X)

Assertion after the width is zero. The matching continues only when the child expression x matches on the left side of the position. For example ,(? <= 19) 99 matches the 99 instance following 19. This construction will not be traced back.

(? <! X)

Assertion after negative width. The matching continues only when the child expression X does not match on the left side of the position. For example ,(? <! 19) 99 matches with 99 instances not following 19

 

 

It can be seen from the expression of assertion that it uses a grouping symbol, but a question mark is added at the beginning. This question mark means that this is a non-capturing group, and this group has no number, it cannot be used for backward reference, but can only be used as an assertion.

This is the end of the tutorial. I hope you will have a good time reading it!

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.