PHP Crawler Practice

Source: Internet
Author: User
Tags tagname composer install

Before with Nodejs Cheerio to do, but Nodejs asynchronous back too disgusting, can't stand.

Later, PHP's Htmlpagedom library was found, similar to jquery's selector syntax, and supported in Chinese.

Installing composer Install Wa72/htmlpagedom

1, read a simple Web page, such as:

require ' vendor/autoload.php ';  Use \wa72\htmlpagedom\htmlpagecrawler; $url = "http://news.cnblogs.com/"; $dom = htmlpagecrawler::create (file_get_contents($url)); Print $dom // Output Content

2, how to analyze, using the jquery selector syntax, you can refer to

such as extracting the blog Park News First page of all links, structure as follows

$news _list=$dom->filter ("#news_list");$news _entry=$news _list->filter (". News_entry");$urls= [];$i= 0;$url _cnt=$news _entry-Count();//print $url _cnt; 30, find in the browser "posted in" is 30, proves to be correct while($i<$url _cnt){    $urls[] =$news _entry->eq ($i)->filter (' A ')->eq (0)->attr ("href"); + + $i;}

There may be questions, why not foreach

Because $news_entry->children () returns domelement instead of Htmlpagecrawler, you cannot use filter and continue with Htmlpagecrawler::create ().

3. Extract the News text

$content = Htmlpagecrawler::create (file_get_contents ($url. $urls [0]));

Print $content->filter ("#news_body")->text ();

4. Description

Some Web sites may not be utf8, then you'll have to use Iconv to transcode them.

You can write a function to encapsulate, $base the root URL, because in many cases the link is relative.

functionHttpGet ($url,$base=NULL) {    if(!$base) {        $url.=$base; }    $html=file_get_contents($url); $encode= Mb_detect_encoding ($html, "Gbk,utf-8"); if(Stripos($encode, "UTF")!==false) {        returnHtmlpagecrawler::create ($html); } Else {        $utf _html=Iconv("GBK", "Utf-8",$html); returnHtmlpagecrawler::create ($utf _html); }}

If you get HTML using the HTML () function, the output is HTML entity encoding, which can be used Html_entity_decode

You can also use Strip_tags to remove certain tags from the HTML.

The ID is unique, and the class and tag are not unique, so get the class and tag, even if only one has to use EQ (0) to get

jquery has a have function to determine if there is a tag, and Htmlpagecrawler missing this, so manually added a.

Under Htmlpagecrawler.php's Hasclass function, add the following code

     Public functionHas ($name) {        foreach($this->children () as $node){             if($nodeinstanceof \domelement) {                $tagName=$node-TagName; if(Stripos($tagName,$name) !==false) {                    return true; }            }        }        return false; }

PHP Crawler Practice

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.