PHPCrawl crawler Library crawlers

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

: This article mainly introduces the PHPCrawl crawler library to capture the hot dog song list. if you are interested in the PHP Tutorial, refer to it. I watched Web crawler-related videos and wanted to crawl something. Recently, there was a fierce competition in table packages on Facebook. I wanted to crawl all the table packages, but I didn't find a suitable VPN for a while. so I had to crawl the song form like a video, capture the songs and brief introductions of cool dog in January to your local device. The code is a bit messy, and I am not satisfied with it. I don't want to put it in sight. But I thought it was my first crawler to record a "first time" in my life. why? So... there is the following unsightly code ~~~ (Ps. I directly add, delete, and modify the example. php file in the PHPCrawl crawler Library. Because the captured data volume is small, I didn't consider multi-process or anything, but I read the PHPCrawl document and found that the PHPCrawl Library has encapsulated all the functions I can think, it is easy to implement. YY will use it to crawl some "big data" next time, and use a visual tool for data analysis. I am still a little excited ~)

Header ("Content-type: text/html; charset = utf-8 ");
// It may take a whils to crawl a site...
Set_time_limit (10000 );
Include ("libs/PHPCrawler. class. php ");
Class MyCrawler extends PHPCrawler {
Function handleDocumentInfo ($ DocInfo ){
// Just detect linebreak for output ("\ n" in CLI-mode, otherwise"
").
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";

$ Url = $ DocInfo-> url;
$ Pat = "/http: \/www \. kugou \. com \/yy \/special \/single \/\ d + \. html /";
If (preg_match ($ pat, $ url)> 0 ){
$ This-> parseSonglist ($ DocInfo );
}
Flush ();
}

Public function parseSonglist ($ DocInfo ){
$ Content = $ DocInfo-> content;
$ SonglistArr = array ();
$ SonglistArr ['raw _ url'] = $ DocInfo-> url;
// Parse the song introduction
$ Matches = array ();
$ Pat = "/Name: <\/span> ([^ ( $ Ret = preg_match ($ pat, $ content, $ matches );
If ($ ret> 0 ){
$ SonglistArr ['title'] = $ matches [1];
} Else {
$ SonglistArr ['title'] = '';
}
// Parse the song
$ Pat = "/$ matches = array ();
Preg_match_all ($ pat, $ content, $ matches );
$ SonglistArr ['nginx'] = array ();
For ($ I = 0; $ I <count ($ matches [0]); $ I ++ ){
$ Song_title = $ matches [1] [$ I];
Array_push ($ songlistArr ['songs'], array ('title' => $ song_title ));
}
Echo"

";
        print_r($songlistArr);
        echo "

";
}
}
$ Crawler = new MyCrawler ();
// URL to crawl
$ Start_url = "http://www.kugou.com/yy/special/index/1-0-2.html ";
$ Crawler-> setURL ($ start_url );
// Only receive content of files with content-type "text/html"
$ Crawler-> addContentTypeReceiveRule ("# text/html #");
// Link Extension
$ Crawler-> addURLFollowRule ("# http: // www \. kugou \. com/yy/special/single/\ d + \. html $ # I ");
$ Crawler-> addURLFollowRule ("# http://www.kugou \. com/yy/special/index/\ d +-2 \. html $ # I ");
// Store and send cookie-data like a browser does
$ Crawler-> enableCookieHandling (true );
// Set the traffic-limit to 1 MB (1000*1024) (in bytes,
// For testing we dont want to "suck" the whole site)
// The crawling size is not limited.
$ Crawler-> setTrafficLimit (0 );
// Thats enough, now here we go
$ Crawler-> go ();
// At the end, after the process is finished, we print a short
// Report (see method getProcessReport () for more information)
$ Report = $ crawler-> getProcessReport ();
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";

Echo "Summary:". $ lb;
Echo "Links followed:". $ report-> links_followed. $ lb;
Echo "Documents received ed:". $ report-> files_received. $ lb;
Echo "Bytes received:". $ report-> bytes_received. "bytes". $ lb;
Echo "Process runtime:". $ report-> process_runtime. "sec". $ lb;
?>

The above introduces the PHPCrawl crawler library to capture the hot dog song list, including the content, hope to be helpful to friends interested in PHP tutorials.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHPCrawl crawler Library crawlers

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support