Regular Expression-Detailed description of the balancing group

Source: Internet
Author: User
Is this article suitable for you?

To understand the essence of this article, you 'd better understand the basics of Regular Expression matching. For example,.*?"Match text content"Asp163", Anyone familiar with regular expressions knows that regular expressions can be matched, but do you know the matching process? If you are not clear about it, the following content may not be suitable for you. Maybe it is too difficult to understand the usage of the balancing group. Therefore, I suggest you first look at the matching principle of the regular expression NFA engine (Sorry, this article is not complete yet, and the following content is not complete yet. If you see this sentence, it means that I am working hard .... It may take some time to prepare an easy-to-understand and easy-to-describe document. I have prepared a tutorial on regular expressions, for some well-known technical articles on the internet, these articles really need to be able to be viewed by people with a basic RegEx, because they all seem to ignore the details, and the text and text descriptions are not ideal, therefore, I am studying the course recording. If you are interested, please wait, or give me eggs or flowers. I have not eaten eggs for a long time. Thank you)

Introduction to the balancing group in the General Regular Expression tutorial

If you want to match a hierarchy that can be nested, you have to use a balance group. For example,XX <AA <BBB> AA> YY"Is the content in the longest brackets captured in such a string?

The following syntax structure is required:

(? <Group>)Name the captured content as a group and press it into the stack.
(? <-Group>)The capture content named "group" pushed into the stack is displayed from the stack,If the stack is empty, the matching of this group fails.
(? (Group) Yes | No)
If the capture content named group exists on the stack, continue to match the expression of part yes; otherwise, continue to match part No.
(?!)Sequential negative loop view. Attempts to match always fail because there is no suffix expression

If you are not a programmer (or you are a programmer unfamiliar with the concept of stack), you can understand the above three syntaxes: the first is to write one (or another) "group" on the blackboard, and the second is to erase a "group" from the blackboard ", the third is to check whether there is a "group" written on the blackboard. If there is a group, continue to match the "yes" section; otherwise, match the "no" section.

What we need to do is to write a "group" on the blackboard every time we encounter a left bracket. Every time we encounter a right bracket, We will erase one, at the end, let's see if there is any on the blackboard.-if there are more left brackets than right brackets, the matching should fail. (To make it clearer, I used it (? 'Group') syntax ):

 

<# Left parentheses of the outermost layer [^ <>] * # The left parentheses of the outermost layer are not the content of the brackets (((? 'Open' <) # When a left bracket is encountered, write an "open" [^ <>] * on the blackboard. # match the content behind the left bracket instead of the brackets) + ((? '-Open'>) # If you encounter a right brace, erase an "open" [^ <>] * # match the content that is not followed by the right brace) + )*(? (Open )(?!)) # In front of the outermost right parenthesis, judge whether there is any "open" on the blackboard that has not been erased; if there is still, the matching fails> # The outermost right parenthesis

 

Why did I write this article?

After reading the above introduction, do you understand? Before I understand the regular expression matching principle, I can read the above introduction to the balancing group, and can only be remembered as a template, but cannot be used flexibly. Therefore, I have read a lot of regular expressions. I would like to thank lxcnn for its technical documents and the book "proficient in regular expressions", which gives me a deeper and more systematic understanding of regular expressions, therefore, on the basis of them, I will make a summary based on my learning experience, and archive my study notes. In addition, it is a pleasure to solve your doubts.

I will not analyze the above Code for the moment. I will first explain the concepts and knowledge about the balancing group.

The following expression Matching Test tool is expresso. This site also provides its perfect cracked version for download.Click to download

Concept and function of a balance Group

Balance GroupIn the old name, the balance is symmetric,It mainly combines several regular syntax rules to provide matching for the nested structure of pairing.A balanced group has two definitions: narrow and broad (? Expression) syntax, while the generalized balancing group is not a fixed syntax rule, but a comprehensive use of several syntax rules. what we usually call a balancing group is a generalized balancing group. Unless otherwise stated in this article, the abbreviation of a balancing group refers to a generalized balancing group.

Matching Principle of the balancing group"

Pay attention to the "Matching Principle" enclosed in the double quotation marks. In fact, at first I thought it was the principle, but now I hope you don't understand it like this. At least I don't want you to be misunderstood by this word.

The matching principle of the balancing group can be explained by using the stack. Let's give an example first and then explain it based on the example.

Source string:A + (B * (C + D)/E + F-(g/(h-I) * j

Regular Expression:\(((? <Open> \ () | (? <-Open> \) | [^ ()]) * (? (Open )(?!)) \)

Requirement Description: matches the content in () pairs.

Output:(B * (C + D) and (g/(h-I ))

I will write the above Regular Expression Code branch and add comments, which looks hierarchical and convenient

\ (# Common character "(" (# grouping structure, used to limit the modifier "*" range (? <Open> \ () # name the capture group. When the open arc "open" count is added to 1, open here is only the name of the capture group. You can name it as you like. | # Branch structure (? <-Open> \) # Narrow-sense balancing group. In this case, the closed arc "open" count minus 1 | # branch structure [^ ()] + # other arbitrary characters other than the arc) * # The above substrings appear 0 or any number of times (? (Open )(?!)) # Determine whether there is "open". If yes, it means no matching and nothing \) # common closed arc

 

For a nested structure, the start and end tags are determined. For this example, the start is "(" and the end is ")", then we will examine the intermediate structure, the intermediate characters can be divided into three types: "(" and ")", and the rest are any characters except the two characters.

Analyze the matching process of the above regular expressions (important ):

1. First find the first"(As the start of the match. That is, the above 1st rows match:A + (B * (C + D)/E + F-(g/(h-I) * j
(Red)

2. After step 2, each matching"(", An open capture group is added to the stack, and the count is increased by 1.

3. After step 3, each matching")The number of open capture groups closest to the stack is reduced by 1.

Careful taste: that is to say, the first line above is the regular"\("Matched:A + (B * (C + D)/E + F-(g/(h-I) * j
(In red), and then, matchCThe preceding"(", At this time, count plus 1; Continue matching, matchD")", Calculate minus 1; -- Note: At this time, it has been matched:A + (B * (C + D)/E + F-(g/(h-I) * j
(Red part), the count in the stack is 0, and the regular expression will continue to match forward.D)The)In the Branch Structure(? <Open> \()Matching failed,(? <-Open> \))Matching also failed[Think about why the matching fails here]Of course[^ ()] +Matching will also fail. Next, the engine gives control(? (Open )(?!))To determine whether the value in the stack is 0. If the value is 0, the "no" branch is matched. Because the condition judgment structure does not have a "no" branch, nothing is done, give control to the following"\)", This regular expression"\)), That isB ))(Parentheses in red)

Matching Process

First match the first "(", and then keep matching until the following two conditions occur, give control(? (Open )(?!))

Think about it. Why do the following two situations show that the control is handed over (? (Open )(?!))

A) The open count in the stack is already 0, and then ")"

B) match to the string Terminator

At this time, the control is handed over (? (Open )(?!)), Determine whether there is a match for open. Because the count is 0 and there is no match at this time, the "no" branch is matched. Because this condition determines that the structure does not have a "no" branch, nothing is done, give control to the next "\)"

If case A is encountered above, then "\)" can match the following ")" and the match is successful;
If the above problem is case B), it will be traced back until the "\)" Match is successful. Otherwise, the entire expression fails to be matched.

Because the narrow balancing group in. Net "(? <Close-open> Expression) "structure, which can dynamically count capture groups in the stack, match to a start tag, add 1 to the stack, and match to an end tag, exit the stack, reduce the count by 1, and then judge whether there is open in the stack. If yes, it indicates that the start and end tags are not paired and do not match. If no, perform backtracking or report matching, this indicates that the start and end tags are matched. After the start and end tags are matched, the facial expressions are matched.

You need to The complete syntax is "(?! Expression )". Because the "expression" does not exist, it indicates that it is not a position, and attempts to match always fail. The function is to report a matching failure when open is not paired.

The following is an example:
<table><tr><td id="td1"> </td><td id="td2"><table><tr><td>snhame</td><td>f</td></tr></table></td><td></td></tr> </table>

The above is part of the HTML code. the problem is that we need to extract the <TD id = "td2"> <TD> tag and delete it. In the past, we used to directly retrieve the tag, such<TD \ s * id = "td2"> [\ s] +? \ </TD>But the problem arises. What we extract is not what we want,

<TD id = "td2">
<Table>
<Tr>
<TD> snhame </TD>

The reason is also very simple. It matches the </TD> label closest to him, but it does not know that this label is not its-_-, is it? Why is the symbol? Let's remove it and make him unlimitedly greedy. But now the problem is even bigger. It matches all the messy things.

<TD id = "td2">
<Table>
<Tr>
<TD> snhame </TD>
<TD> F </TD>
</Tr>
</TD>
<TD> </TD>

This result is not what we want. Then I will use the "balance group" to solve the problem.

<TD \ s * id = "td2" [^>] *> ((? <Mm> <TD [^>] *>) + | (? <-MM> </TD>) | [\ s]) *? (? (Mm )(?!)) </TD>

The matching result is:

<TD id = "td2">
<Table>
<Tr>
<TD> snhame </TD>
<TD> F </TD>
</Tr>
</Table>
</TD>
<TD> </TD>

This is exactly what we want

Note: I started to write this method.

<TD \ s * id = "td2" [^>] *> ((? <Mm> <TD [^>] *>) + | (? <-MM> </TD>) | [\ s]) * (? (Mm )(?!)) </TD>

The matching result is:

<TD id = "td2">
<Table>
<Tr>
<TD> snhame </TD>
<TD> F </TD>
</Tr>
</Table>
</TD>
<TD> </TD>

One problem

The following code is just for discussion

Text Content:E + f (-(g/(h-I) * j
Regular Expression:

\(  (    (?<mm>\()    |    (?<-mm>\))    |    .  )*?  (?(mm)(?!))\)

The matching result is:(-(G/(h-I ))

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.