Use regular expressions in Java/Js to match Nested Html tags

Source: Internet
Author: User
Tags expression engine

Common HTML Tag Normalization

I recently read the website log and found that someone posted a blog with a regular expression that matches HTML tags that I did not know a few years ago. Recently, I was doing some related things and I was interested. I will take it back for modification. As shown below, there may be some case omissions. You are welcome to modify them. It is known that the processing capability of embedded <script> complex content is weak, but it is enough for pure HTML, it is good to use some analysis tools.Copy codeThe Code is as follows: <script type = "text/javascript">
Var str = "<br/> <br> <Chinese> <div id = a> carefree script 0 & testver <500) alert ('test'); \ "\ n onerror = 'alert (\" test \ ") '/> </Div> Var reg =/<(? :(? :\/? [A-Za-z] \ w * \ B (? : [= \ S] (['"]?) [\ S \ S] *? \ 1) *) | (? :! -- [\ S \ S] *? --) \/?> /G;
Alert (str. match (reg). join ("\ n ---------------------------------------------------- \ n "));
</Script>
[Ctrl + A select all Note: If you need to introduce external Js, You need to refresh it to execute]

Some friends leave a message saying that an error will be reported if Java is used directly. I checked later and found that the Java Regular Expression Engine supports fewer features. In version 1.6, you cannot use a naming group (it seems that the Group is supported in version 1.7). Otherwise, the following error will be reported, not to mention the balance group. Therefore, it is unrealistic to implement an infinitely nested match.Copy codeThe Code is as follows: java. util. regex. PatternSyntaxException: Look-behind group does not have an obvious maximum length near index XX

I searched the internet for a long time and did not find the perfect solution. However, we can implement limited-level Html Nested Tag matching. The concept is much simpler than the infinite level, and there is no need for so many advanced features.
Example:Copy codeThe Code is as follows: <div id = 'Container'> <BR> <div style = 'background-color: gray; 'id = 'footer '> <BR> <a id = 'gotop' href =' # 'onclick = 'mgjs. goTop (); return false; '> Top </a> <BR> <a id = 'powered' href = 'HTTP: // wordpress.org/'> WordPress </a> <BR> <div id = 'copyright'> <BR> copyright 2009 Simple Life -- Kevin Yang's blog <BR> </div> <BR> <div id = 'themeinfo'> <BR> Theme by <a href = 'HTTP: // www.neoease.com/'> mg12 </a>. valid <a href = 'HTTP: // valida Tor.w3.org/check? Uri = referer '> XHTML 1.1 </a> <BR> and <a href = 'HTTP: // jigsaw.w3.org/css-validator/'> CSS 3 </a>. <BR> </div>

In the above example, we intend to match the nested div with the id footer, and assume that we know in advance that the footer div can only be nested with a level-1 div. Let's talk about more levels later.
The match between the start and end labels of footer is simple:Copy codeThe Code is as follows: <div [^>] * id = 'footer '[^>] *> ...... (The ellipsis here will be filled in later) </div>

The content between the start and end tags is in either of the following situations:
Content A: div tag, and no nested div in this div
Content B: Any other content
Then there is the constant repetition of the two types of content. The regular expression is as follows:Copy codeThe Code is as follows: <div [^>] *> .*? </Div> | .)*?

Note that the question mark must be added at the end. Otherwise, the closed tag of footer may fail to match because of the greedy matching feature of regular expressions.
OK. The regular expression for matching a maximum of nested level div is as follows:Copy codeThe Code is as follows: <div [^>] * id = 'footer '[^>] *> (<div [^>] *> .*? </Div> | .)*? </Div>

So what if the footer tag contains a maximum of two levels of div nesting?
In fact, it is not difficult. We only need to replace the dot in content. Modify as follows:Copy codeThe Code is as follows: <div [^>] * id = 'footer '[^>] *> (<div [^>] *> (<div [^>] *>. *? </Div> | .)*? </Div> | .)*? </Div>

Here you may know how to write a regular expression if you want to match up to three-level div nesting:Copy codeThe Code is as follows: <div [^>] * id = 'footer '[^>] *> (<div [^>] *> (<div [^>] *> (<div [^>] *>. *? </Div> | .)*? </Div> | .)*? </Div> | .)*? </Div>

So in fact, as long as your html structure is not very complex, that is to say, nesting is not very deep, you can use this method to match Nested html tags.
This regular expression can be used in both Java and Javascript because it does not have any advanced features.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.