Methods and techniques for matching nested HTML tags with PHP regular expressions

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint please indicate the source: http://blog.csdn.net/donglynn/article/details/35788879

Regular expressions are a very useful programming skill. Generally, it is easy to grab a piece of information from an HTML page, such as a <title> title </title>. However, we tend to crawl the specific contents of multiple <div></div> blocks in a list page, and the <div></div> blocks are nested, and we crawl with each repeating <div Multiple information in a ></div> block. At the same time, the Web page source file differs from the generic string, and there is a large number of carriage returns, line feeds, and tabs, all of which result in a matching failure. Beginners are often unable to determine which link is the problem, and see highly skilled regular expressions will feel very frustrated, leading to the solution of the abandonment problem.

After the author of several days of research, and finally explore the following methods and skills, welcome to exchange corrections.

Look at the following note points and steps:

1. Note/must be escaped to \/, otherwise it will be an error

Preg_match_all () [Function.preg-match-all]: Unknown modifier

2. Regular expressions use single quotes ' and/As the bounds of start and end, such as '/reg partten/', in which the double quotes in the regular expression "do not have to escape

Like what

$partten = '/<div class= ' Goods_item "><a href=" ([^<>]+) "target=" _blank ">]+) "alt=" ([^<>]+) "Width=" "height=" "\/>/";

3. You need to remove all line breaks, tabs, carriage returns, and so on, for easy to read HTML source files due to the existence of these symbols can not match.

$str =preg_replace ("/[\t\n\r]+/", "", $str);

4. The matching information we are interested in is usually the value of the attribute in the HTML element, so you want to remove the <>, otherwise you will only match all the information before the last one.

For example, for $string= "<div><a href=" 1.jpg "></a></div><div><a href=" 2.jpg "></a> </div><div><a href= "3.jpg" ></a></div> ",

$partten = '/<div><a href= ' (. +) "/"; The match result is: 1.jpg "></a></div><div><a href=" 2.jpg " ></a></div><div><a href= "3.jpg" ></a></div>

This is because the given regular expression does not qualify the matching range just the first hyperlink <a href= "1.jpg" ></a>.

Therefore, to match the href attribute of the above three hyperlinks, you need to qualify the above matching in the <a href= "1.jpg" >, the method is very simple, replace (. +) with ([^<>]+), you can. That is, the match does not contain the next occurrence of the <>, thus qualifying the match within the same HTML tag

Do the above, you can completely ignore the HTML tags nested nesting problem, so crawled to a page of all the DIV repeat block we are interested in the content, attached to an example.

<?//Matched HTML code $html = ' <div class= ' goods ' > <a href= ' http://url1111 ' target= ' _blank ' >  </a> </div> <div class = "goods" > <a href= "http://url2222" target= "_blank" >  </a> </div> <div class=" goods "> <a href=" http://url3333 "target=" _blank "

>  </a> </div>";

Remove lines, tables and other special characters, you can echo to see the effect $html =preg_replace ("/[\t\n\r]+/", "", $html); Match expression, note two points, one is included in '//' inside, then/to do escape processing into \/$partern = '/<div class= ' goods "><a href=" ([^<>]+) "target=" _ Blank ">]+) "alt=" ([^<>]+) "Width=" "height=" "\/><\/a><\/"

div>/'; 

Matching results preg_match_all ($partern, $html, $result); 
Print results var_dump ($result); ?

Output, a total of 4 sub arrays, the first of which are matched to all of the items, and the following three sub arrays are the three matches in our matching expression:

Array (4) {[0]=> Array (3) {[0]=> string (144) "<div class=" goods "><a href=" http://url1111 "Targ et= "_blank" ></a></div" > "[1]=> string (144)" <div class= "goods" ><a href= "http://url2222" target= "_blank" ></a></div>" [2]=> string (144) "&L" T;div class= "goods" ><a href= "http://url3333" target= "_blank" ></a></div>"} [1]=> Array (3) {[0]=> string () http ://url1111 "[1]=> string" http://url2222 "[2]=> string" http://url3333 "} [2]=> arr Ay (3) {[0]=> string "http://1111.jpg" [1]=> string () "Http://2222.jpg" [2]=> Strin G () "Http://3333.jpg"} [3]=> ARray (3) {[0]=> string (7) "alt1111" [1]=> string (7) "alt2222" [2]=> string (7) "alt3333" }
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Methods and techniques for matching nested HTML tags with PHP regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Methods and techniques for matching nested HTML tags with PHP regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support