Java/js using regular expressions to match nested HTML tags _ regular expressions

Source: Internet
Author: User
Tags html tags
Generic HTML tag area with regular

Recently read the website log, I found someone on the blog I do not know a few years ago to write a matching HTML tags, just recently also doing some related things, immediately interested. Take it back to change, became the following, there may be some case omission, welcome to modify, the known embedded <script> complex content processing capacity is weak, but for pure HTML is enough, to do some analysis tools or a good drop.
Copy Code code as follows:

<script type= "Text/javascript" >
var str = "<br/><br/><br><br >< Chinese ><div><div id=a> worry-free script 0 && testver<500) alert (' test '); \ \ n onerror= ' alert (\ "test\") '/>< /div>var reg =/< (?:(?: \ /? [a-za-z]\w*\b (?: [=\s] ([' "]?) [\s\s]*?\1) | (?:!--[\s\s]*?--)] \/?>/g;
Alert (Str.match (REG). Join ("\ n----------------------------------------------------\ n"));
</script>


<textarea id="runcode64954"><script type= "Text/javascript" > var str = "<br>[br/]<br>< Chinese ><div><div id=a> Worry-free script 0 && testver<500) alert (' test '); \ \ n onerror= ' alert (\ "test\") '/></div><pr ><script Type=\ "test/javascript\" Defer>alert (\ "Just a test!\"); <\/script>hello.<input type=text value=\ "worry-free script \" ><BR/></><!--annotation-->< ucren><!--re-< note >--><b>123</b>1<2< 3,3<4>1<b><!--Three notes >>>--> "; var reg =/< (?:(?: \ /? [a-za-z]\w*\b (?: [=\s] ([' "]?) [\s\s]*?\1) | (?:!--[\s\s]*?--)] \/?>/g; Alert (Str.match (REG). Join ("\ n----------------------------------------------------\ n")); </script></textarea>
[Ctrl + A All SELECT Note: If the need to introduce external JS need to refresh to perform]


Have a friend message that Java Direct use of words will be an error. I checked later and found that the Java regular engine supported a relatively small number of features. You cannot use a named group in version 1.6 (which seems to be supported when you are 1.7), otherwise you will report the following error, let alone a balanced group. Therefore, it is not realistic to realize that the nesting of infinite levels is not matched.
Copy Code code as follows:

Java.util.regex.patternsyntaxexception:look-behind Group does not have a obvious maximum length near index XX

Search on the internet for a long time did not find the perfect solution. However, we can implement a finite-level HTML nested tag match. The idea is much simpler than the infinite level, and does not require so many advanced features.
Example:
Copy Code code as follows:

<div id= ' container ' ><BR> <div style= ' background-color:gray ' id= ' footer ' ><BR> <a ' id= ' Gotop ' href= ' # ' onclick= ' Mgjs.gotop (); return false; ' >Top</a><BR> <a id= ' powered ' href= ' http://wordpress.org/' >WordPress</a><BR> < Div id= ' copyright ' ><BR> copyright©2009 Simple Life--kevin Yang's blog <BR> </div><BR> <div id= ' them Einfo ' ><BR> Theme by <a href= ' http://www.neoease.com/' >mg12</a>. Valid <a href= ' http://validator.w3.org/check?uri=referer ' >xhtml 1.1</a><br> and <a href= ' http:/ /jigsaw.w3.org/css-validator/' >css 3</a>.<br> </div><BR> </div><br></ Div>

In the example above, we're going to match this nested div with ID footer, and assume we know in advance that the footer Div will have a maximum nesting level of Div. More level of the situation we'll talk about it in a minute.
Footer start and end tag matching is simple:
Copy Code code as follows:

<div [^>]*id= ' footer ' [^>]*> ...] (the ellipsis here will be filled in for a while) </div>


The content between the start and end tags is in two ways:
Content A:div tag, and there are no nested div inside this div
Content B: Any other content
And then there's the repetition of the two kinds of content. It is indicated as follows:
Copy Code code as follows:

(<div[^>]*>.*?</div>|.) *?

Note that the last question mark must be added, otherwise, due to the regular greedy matching feature, the footer closed tag will match the error.
OK, the regular expressions that match the most nested first-level div are as follows:
Copy Code code as follows:

<div [^>]*id= ' Footer ' [^>]*> (<div[^>]*>.*?</div>|.) *?</div>

What if the footer tag has up to two div nested?
It's not difficult, but we just need to replace the dots in the "Content a" section above. Modified as follows:
Copy Code code as follows:

<div [^>]*id= ' Footer ' [^>]*> (<div[^>]*> (<div[^>]*>.*?</div>|.) *?</div>|.) *?</div>

Here you may know that if you want to match up to three-level div, then what should be written:
Copy Code code as follows:

<div [^>]*id= ' Footer ' [^>]*> (<div[^>]*> <div[^>]*> (<div[^>]*>.*?</div >|.) *?</div>|.) *?</div>|.) *?</div>

So in fact, as long as your HTML structure is not particularly complex, which means the nesting is not very deep, then you can use this method to match the nested HTML tags.
This is available in both Java and JavaScript because it does not use any of the advanced features.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.