) C #. net uses regular expressions to match Nested HTML tags

Source: Internet
Author: User
Tags expression engine

Original address http://www.cnblogs.com/qiantuwuliang/archive/2011/06/11/2078329.html

Overview

Regular Expressions are essential for text parsing. Such as Web server log analysis and Web browser development. Many advanced text editors support a subset of regular expressions. They are familiar with regular expressions and often help you get twice the result with half the effort. For example, statisticsCodeThe number of rows. Only one regular expression is required. Matching of Nested HTML tags is a difficult topic in the application of regular expressions, because it involves more and more difficult regular syntax. Therefore, it is more valuable for research.

Ideas

Any complex regular expressions are composed of simple sub-expressions. To write complex regular expressions, you must have the knowledge of simplifying them, we need to consider the problem from the perspective of the Regular Expression Engine. For the principles of the Regular Expression Engine, mastering regular expression is recommended. The Chinese name is "proficient in regular expressions". A good book.

OK, first determine the problem we want to solve --Find the innerhtml of the tag of a specific ID from the HTML text.

The biggest difficulty here is that HTML tags support nesting. How can we find the closed tags corresponding to a specified tag?

We can think like this,First, match the starting label at the beginning. Assume It is Div bar (<Div). Then, once a nested div is encountered, press it into the stack. Then, when a div ends the label, click "pop-up stack ". If there is no such thing in the stack when an end tag is encountered, the end tag is the correct closed tag..

The reason why I think this is because I have learned about the features of regular expressions. I know that the balancing group in regular expressions can implement the "stack" operation I just mentioned. Therefore, if we want to write complex regular expressions, we need to have at least some advanced features of regular expressions, so that we can think about the problem.

Implementation

Assume that the text to be matched is a legal HTML text. The following HTML code is copied from my blog as our test text. We need to match the innerhtml of the footer Div and capture the tag name.

< Div Style = "Background-color: Gray ;" ID = "Footer" >
< A ID = "Gotop" Href = "#" Onclick = "Mgjs. gotop (); Return false ;" > Top </ A >
< A ID = "Powered" Href = "Http://wordpress.org /" > WordPress </ A >
< Div ID = "Copyright" >
Copyright & Copy; 2009 simple life-Kevin Yang's blog </ Div >
< Div ID = "Themeinfo" >
Theme < A Href = "Http://www.neoease.com /" > Mg12 </ A > .
Valid < A Href = "Http://validator.w3.org/check? Uri = Referer" > XHTML 1, 1.1 </ A >
And < A Href = "Http://jigsaw.w3.org/css-validator" > CSS 3 </ A > .
</ Div >  
</ Div >  
 
 

Here we need to use the Expresso tool to build and test the compiled regular expressions.

Match start tag

The start tag features are well extracted, starting with Angle brackets, followed by a series of English letters, and then matching ID (Case Insensitive) = footer in a large string of attributes (non-angle bracket characters. Note that footer can be enclosed in double quotation marks or single quotation marks, or it can be left empty. The regular expression is as follows:

< ( ? < Htmltag > [\ W] + )[ ^> ] * \ S [II] [DD] = ( ? < Quote > [ " ']?) Footer (? (Quote) \ K <quote> )[ "' ]? [^>] *>

The above regular expressions need to be described as follows:

1. <The angle bracket is a special character in the regular expression. It is used to enclose the group name in an explicit capture group. However, because the angle brackets at the beginning do not have parsing ambiguity in this context, the effect of adding an escape character is the same.

2 .(? <Groupname> RegEx) defines a naming group. We define an htmltag tag group to store the matching HTML Tag name. The quote group is used for subsequent matching.

3 .(? (Groupname) then | else) is a condition statement, indicating that then matching is executed when the groupname group is captured; otherwise, else matching is executed. In the above regular expression, we first try to match the quotation marks on the left side of the footer string and store them in the leftquote group. Then, we perform conditional parsing on the right side of footer. If it matches the leftquote group, then the leftquote group should also be criticized on the right. In this way, we can precisely match various situations of IDs.

Matching closed tags

(( ? < Nested > < \ K < Htmltag > [ ^> ] *> ) | </ \ K < Htmltag > ( ? <- Nested > ) | . *? ) * </ \ K < Htmltag >

After the start tag is successfully matched, the following HTML text can be divided into three types:

A. Match the starting label of the nested Div <Div. In this case, capture it to the nested group.

B. Match the closed tag of the starting label of the nested Div. In this case, you need to release the previous nested group.

C. Any other text. Note that you need to use .*? Otherwise, the last closed tag may be over-matched.

In this way (regex1 | regex2 | regex3) *, you can combine several conditions in the form of or, then obtain several matching results, and finally match the closed tag. Where (? <-Nested>) indicates the nested group captured before release. The exact syntax is (? <N-M>) replaces M groups with N groups and releases M groups if n groups are not specified or do not exist.

Update: I have focused too much on the analysis. I am sorry that I didn't provide a complete regular expression.

< ( ? < Htmltag > [\ W] + )[ ^> ] * \ S [II] [DD] = ( ? < Quote > [ " ']?) Footer (? (Quote) \ K <quote> )[ "' ]? [^>] *> ((? <Nested> <\ K

The above regular expression can match any HTML tag with ID = footer.

Note that this regular expression needs to be setSingleline = trueIn this way, the line break can be matched. 

For domoxz, if you want to match the P tag, you only need to replace the htmltag in the above regular expression with P.

Add another practical requirement: If you want to match the innerhtml of any HTML tag with ID = footer, replace the regular expression with the following:

< ( ? < Htmltag > [\ W] + )[ ^> ] * \ S [II] [DD] = ( ? < Quote > [ " ']?) Footer (? (Quote) \ K <quote> )[ "' ]? [^>] *> (((? <Nested> <\ K

C # reference code:

Matchcollection m = RegEx. Matches ( This . Tp4rtbcont. Text, This . Tp4txtregx. Text. Trim (), regexoptions. ignorecase | Regexoptions. multiline | Regexoptions. singleline );

If (M. count 0 )
{< br> foreach (match subm in m)
{< br> This . tp4rtbresult. text += C. tostring () + " . " + subm. groups [ 0 ]. value. replace ( " AMP; " , "" ) + " \ r \ n ++ ++ \ r \ n "< /span> ;

}

}< br> else
{< br> This . tp4rtbresult. text += " match failed... " + " / span> \ r \ n ++ ++ \ r \ n " ;
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.