[GO] match nested HTML tags with regular expressions

Source: Internet
Author: User
Tags closing tag tag name

Original link https://msdn.microsoft.com/zh-cn/ff686933.aspx

This article from Kevin Yang blog Kevin Yang

Overview

Regular expressions are an essential skill for text parsing. such as Web server log analysis, Web page front-end development. Many advanced text editors support a subset of regular expressions and are familiar with regular expressions, often to make some of your work more effective. For example, count the number of lines of code, just one regular to get it done. The matching of nested HTML tags is a difficult topic in regular expression applications, because it involves more regular syntax and is more difficult. Therefore, it is more valuable to study.

Ideas

Any complex regular expression is composed of simple sub-expressions, in order to write complex regular, on the one hand need to have the foundation of simplicity, on the other hand, we need to think from the perspective of the regular engine. About the principle of the regular engine, recommended "mastering Regular expression" Chinese called "proficient regular expression". It's a very good book.

OK, first determine the problem we are going to solve--to find out the innerHTML of the tag for a particular ID froma piece of HTML text.

The biggest difficulty in this is that the HTML tag is nested, how can we find the corresponding closed tag of the specified label?

We can think of this by matching the first starting tag, assuming the Div bar (<div), and then "pressing the stack" as soon as the nested DIV is encountered, then "pops up the stack" as soon as the div end tag is encountered. If there is nothing in the stack when the end tag is encountered , the end tag is the correct closing tag.

I am able to think like this because I know the characteristics of the regular, and I know that the balance group in the regular can achieve the "stack" operation I just said. So, if we are going to write complex regular expressions, we need to know at least some of the advanced features of the regular, so that we have a way of thinking about the problem.

Realize

This assumes that the text we want to match is a valid HTML text. The following HTML code is copied from my blog as our test text. What we want to match is footer this div's innerHTML, and also captures the tag name.

<div style= "" id= "Footer" >     <a id= "Gotop" href= "#" onclick= "Mgjs.gotop (); return false;" >Top</a>     <a id= "powered" href= "http://wordpress.org/" >WordPress</a>     <div id= " Copyright >         Copyright &copy; 2009 Simple Life--kevin Yang's blog    </div>     <div id= "Themeinfo" >         Theme by <a href= "http://www.neoease.com/" >MG12</A>.  Valid <a href= "Http://validator.w3.org/check?uri=referer" >xhtml 1.1</a> and         <a href= "/http/ jigsaw.w3.org/css-validator/">css 3</a>     

Here we need to use the Expresso tool to build and test the written regular expression.

Match start tag

The start tag feature is very well extracted, preceded by the angle brackets, followed by a series of English letters, and then a large string of attributes (non-angle bracket characters) match the ID (case-insensitive) =footer. It is important to note that footer can be wrapped in double quotes or single quotes, or without adding anything. The regular is as follows:

< (? 

The above regular expression requires a few notes:

1. < The angle brackets are a special character in the regular, which is used to enclose the grouping names in an explicit capture group. However, because the opening angle brackets are not ambiguous in this context, the addition of the escape character effect is the same.

2. The (? <groupname>regex) format defines a named grouping, where we define a Htmltag label grouping to hold the matching HTML tag name. The quote group is used to give back a match.

3. (? ( GroupName) then| else) is a conditional statement that indicates that a then match is performed when a groupname group is captured, otherwise an else match is performed. In the above regular, we first try to match the footer string to the left of the quotation marks, and put it into the Leftquote group, and then on the right side of the footer conditional resolution, if the previous match to the Leftquote group, then the right side should also criticize leftquote grouping. In this way, we can accurately match the various cases of ID.

Matching closed labels

((?<nested><\k

After a successful match to the start tag, the following HTML text can be divided into three cases:

A. Match to nested div start tag <div, this time, you need to capture it to the nested group.

B. Match the closed tag to the nested div start tag, and this time, you need to release the previous nested group

C. Any other text. Note that you need to use the. * method to turn off greedy matching, otherwise the last closed tag may be over-matched

Use (regex1| regex2| REGEX3) * This way, you can combine several conditions in or form, and then take a number of matching results, and then finally match the closed tag. where (?<-nested>) is the Nested grouping that is captured before the release. The exact syntax is (?<n-m>) to replace the m groupings with n groupings, and if n groupings are not specified or not present, the M groupings are freed.

Update: The front is too focused on analysis, and finally did not give a complete regular is really sorry.

< (? 

The above regular can match any id=footer HTML tag.

It is important to note that this regular expression needs to be set singleline=true so that the dot can match the line break.

For DOMOXZ problems, if you want to match the P tag, simply replace the Htmltag in the above regular with P.

[GO] match nested HTML tags with regular expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.