PHP Regular expression matching methods and tips for nesting HTML tags

Last Update:2014-06-30 Source: Internet

Author: User

Tags php regular expression

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://blog.csdn.net/donglynn/article/details/35788879

Regular expressions are a very useful programming skill. In general, it is easy to crawl a piece of information on an HTML page, such as the <title> title </title>. However, we tend to crawl a certain list of pages in a number of repeated <div></div> blocks of specific content, and the <div></div> block and nested use, we crawl is each repeat <div Multiple messages in the ></div> block. At the same time, the Web page source file is different from the normal string, it also has a large number of carriage returns, line breaks and tabs, which caused the match to fail. Beginners are often unable to determine which link is at issue, and see highly skilled regular expressions that can be very frustrating, leading to a waiver of problem resolution.

After several days of research, finally found out the following methods and techniques, welcome everyone to communicate correct.

Take a look at the following points and steps:

1. Note/must be escaped into/, otherwise will be error

Preg_match_all () [Function.preg-match-all]: Unknown modifier

2. Regular expressions with single quotation marks ' and/as the beginning and end of the demarcation, such as '/reg partten/', using this notation, the regular expression of the double quotes "do not have to escape

Like what

$partten = '/<div class= ' Goods_item "><a href=" ([^<>]+) "target=" _blank ">]+) "alt=" ([^<>]+) "Width=" "height=" "\/>/";

3. You need to remove all line breaks, tabs, carriage returns, and so on, for easy-to-read HTML source files due to the existence of the above symbol will cause the mismatch.

$str =preg_replace ("/[\t\n\r]+/", "", $str);

4. The matching information we are interested in is usually the value of the attribute in the HTML element, so we want to remove <>, otherwise it will only match all the information before the last one.

For example, for $string= "<div><a href=" 1.jpg "></a></div><div><a href=" 2.jpg "></a> </div><div><a href= "3.jpg" ></a></div> ",

$partten = '/<div><a href= ' (. +) "/"; The matching result is: 1.jpg "></a></div><div><a href=" 2.jpg " ></a></div><div><a href= "3.jpg" ></a></div>

This is because the regular expression given above does not have a qualifying range just for the first hyperlink <a href= "1.jpg" ></a>.

Therefore, to match the href attribute of the above three hyperlinks, we need to limit the above match to <a href= "1.jpg" >, the method is very simple, replace (. +) with ([^<>]+), you can. That is, the match does not contain the next occurrence of <>, thus qualifying the match within the same HTML tag

To do the above, you can completely ignore the HTML tag nesting problem, so as to crawl to a page all the div repeating block of our interest in the content, attached to an example.

<?//matched HTML code $html= ' <div class= "goods" ><a href= "http://url1111" target= "_blank" ></a></div><div class= "goods "><a href=" http://url2222 "target=" _blank "></a></div><div class=" goods "><a href=" http://url3333 "target=" _blank "> </a></div> ";//Remove line breaks, Special characters such as watchmaking, you can echo to see the effect $html=preg_replace ("/[\t\n\r]+/", "" ", $html);//Match expression, note two points, one is contained in the '//' inside, and then/to do escape processing into \/$partern = '/<div class= ' goods "><a href=" ([^<>]+) "target=" _blank ">]+) "Alt=" ([^<>]+) "Width=" "height=" "\/><\/a><\/div>/";//Match result Preg_match_all ($partern, $html, $ result); Print result Var_dump ($result);?>

Output, a total of 4 sub-arrays, the first sub-array is matched to all the items, followed by three sub-arrays are the three matches in our matching expression:

Array (4) {[0]=> Array (3) {[0]=> string (144) "<div class=" goods "><a href=" http://url1111 "target="    _blank "></a></div>" [1]=> string (144) "<div class=" goods "><a href=" http://url2222 "target=" _blank "></a></div> "[2]=> string (144 ) "<div class=" goods "><a href=" http://url3333 "target=" _blank "></a></div>"} [1]=> Array (3) {[0]=> string (+) "http     ://url1111 "[1]=> string" http://url2222 "[2]=> string (+)" http://url3333 "} [2]=> Array (3) { [0]=> string (http://1111.jpg) [1]=> string ("Http://2222.jpg" [2]=> string "http:/") /3333.jpg "} [3]=> Array (3) {[0]=> StriNg (7) "alt1111" [1]=> string (7) "alt2222" [2]=> string (7) "alt3333"}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More