Using regular expressions to match nested HTML tags under java/js

Source: Internet
Author: User

Transferred from: http://www.jb51.net/article/24422.htm

Previously wrote an article on how to use regular expressions to perfectly solve the problem of the matching of HTML nested tags (using regular expressions to match nested HTML tags), but the use of the balance group such advanced features, seemingly only dotnet and Perl regular engine support, so the generality is not high.

Common HTML tag area with regular

Recently looked at the website log, found someone on the blog to turn on I do not know a few years ago wrote a matching HTML tag regular, just recently also do some related things, suddenly came to interest. Take it back to change, become the following, there may be some case omission, welcome to modify, known as embedded <script> complex content processing capacity is weak, but for pure HTML is enough, to do some analysis tools or good drop.

<Scripttype= "Text/javascript"> varStr= "<br/><br/><br><br >< Chinese ><div><div id=a> worry-free script 0 && testver<500) alert (' Test '), \ "\ n onerror= ' alert (\" test\ ") '/>< /div>"; varReg= /< (?:(?: \ /? [a-za-z]\w*\b (?: [=\s] ([' "]?) [\s\s]*?\1) *) | (?:!--[\s\s]*?--)] \/?>/g; alert (Str.match (REG). Join ("\ n----------------------------------------------------\ n")); </Script> 

A friend of the message said that the direct use of Java will error. I checked later and found that the Java regular engine supported relatively few features. You cannot use named groups in version 1.6 (which appears to be supported at 1.7), otherwise the following error will be reported, let alone the balance group. So it is not realistic to feel that an infinite level of nested matching is possible.

Java.util.regex.PatternSyntaxException:Look

Searched the internet for a long time did not find the perfect solution. However, we can implement finite-level HTML nesting tag matching. The idea is much simpler than an infinite level, and does not require so many advanced features.
Example:

 <DivID= ' container '><BR> <Divstyle= ' Background-color:gray; 'ID= ' Footer '><BR> <aID= ' Gotop 'href= ' # 'onclick= ' mgjs.gotop (); returnfalse; '>Top</a><BR> <aID= ' powered 'href= ' http://wordpress.org/'>Wordpress</a><BR> <DivID= ' copyright '><BR>copyright©2009 Simple Life--kevin Yang's blog<BR> </Div><BR> <DivID= ' Themeinfo '><BR>Theme by<ahref= ' http://www.neoease.com/'>Mg12</a>. Valid<ahref= ' Http://validator.w3.org/check?uri=referer '>XHTML 1.1</a><BR>and<ahref= ' http://jigsaw.w3.org/css-validator/'>CSS 3</a>.<BR> </Div><BR> </Div><BR></Div> 

In the above example, we are going to match this nested div with ID footer, and suppose we know in advance that the footer Div will be nested at most one level div. More stages we'll talk about it later.
The footer start and end tag matches are simple:

<[^>]*id= ' footer ' [^>]*> (The ellipsis here is to be filled in a minute) </ Div >

The contents of the clip between the start and end tags are nothing more than two things:
Content A:div tag, and there is no nested div inside this div
Content B: Any other content
Then there is the repetition of these two kinds of content. The regular representation is as follows:

(<Div[^>]*>.*? </ Div >

Note that the last question mark must be added, otherwise, due to the greedy matching characteristics of the footer, the closed label will match the error.
OK, the regular expression that matches the most nested one-level div is as follows:

<[^>]*id= ' footer ' [^>]*> (<Div[^  >]*>.*? </ Div >|.) *? </ Div >

So what if the footer tag has a maximum of two-level div nested inside it?
In fact, it's not difficult, we just need to replace the dot in the "Content a" section above. Modify the following:

<[^>]*id= ' footer ' [^>]*> (<Div[^  >]*> (<Div[^>]*>.*?  </div>|.) *? </ Div >|.) *? </ Div >

As you can see here, if you want to match the most nested three-level Div, the regular should write:

 <Div[^>]*id= ' Footer ' [^>]*> (<Div[^>]*> (<Div[^>]*> (<Div[^>]*>.*?</Div>|.) *?</Div>|.) *?</Div>|.) *?</Div> 

So, in fact, as long as your HTML structure is not particularly complex, that is, nesting is not very deep, then you can use this method to match nested HTML tags.
This regular is available in both Java and JavaScript because it does not use any of the advanced features.

Using regular expressions to match nested HTML tags under java/js

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.