Web page crawling: Summary of Web Page crawling in PHP, crawling Crawlers
Source: http://www.ido321.com/1158.html
To capture the content of a webpage, You need to parse the DOM tree, find the specified node, and then capture the content we need. This process is a bit cumbersome. LZ summarizes several common and easy-to-implement web page capturing methods. If you are familiar with JQuery selector, these frameworks will be quite simple.
1. Ganon
Project address: http://code.google.com/p/ganon/
Document: http://code.google.com/p/ganon/w/list
Test: capture all the class attribute values on the homepage of my website as the div element of focus and output the class value.
<?php include 'ganon.php'; $html = file_get_dom('http://www.ido321.com/'); foreach($html('div[class="focus"]') as $element) { echo $element->class, "<br>\n"; }?>
Result:
Ii. phpQuery
Project address: http://code.google.com/p/phpquery/
Document: https://code.google.com/p/phpquery/wiki/Manual
Test: capture the article Tag Element on the homepage of my website and publish the html value of the h2 tag
<?phpinclude 'phpQuery/phpQuery.php'; phpQuery::newDocumentFile('http://www.ido321.com/'); $artlist = pq("article"); foreach($artlist as $title){ echo pq($title)->find('h2')->html()."<br/>"; } ?>
Result:
3. Simple-Html-Dom
Address: http://simplehtmldom.sourceforge.net/
Document: http://simplehtmldom.sourceforge.net/manual.htm
Test: capture all links on the home page of my website
<? Phpinclude 'simple _ html_dom.php '; // you can use url and file to create DOM $ html = file_get_html ('HTTP: // www.ido321.com /'); // find all images // foreach ($ html-> find ('img ') as $ element) // echo $ element-> src. '<br>'; // find all links, foreach ($ html-> find ('A') as $ element) echo $ element-> href. '<br>';?>
Result: (Part 1)
Iv. Snoopy
Project address: http://code.google.com/p/phpquery/
Document: http://code.google.com/p/phpquery/wiki/Manual
Test: capture the homepage of my website
<? Phpinclude ("Snoopy. class. php "); $ url =" http://www.ido321.com "; $ snoopy = new Snoopy; $ snoopy-> fetch ($ url); // get all content echo $ snoopy-> results; // display the result // echo $ snoopy-> fetchtext; // obtain the text content (remove the html code) // echo $ snoopy-> fetchlinks ($ url ); // obtain the link // $ snoopy-> fetchform; // obtain the form?>
Result:
5. Manually write Crawlers
If the writing capability is OK, you can write a Web Crawler to capture webpages. LZ will not repeat this article on the Internet. If you are interested, you can crawl it on the Baidu php webpage.
Ps: Resource Sharing
Common open-source crawler projects please stamp: http://blog.chinaunix.net/uid-22414998-id-3774291.html
Next article: the "fart theory" of the National Father-in-law"
Php Web Crawlers collect part of a website
You can use the simpl_html_dom class to collect data. If you have jquery, I believe you will understand it. Good luck.
Crawlers crawl web keywords and summaries for search
Strip_tags ($ string)