: This article mainly introduces the PHPCrawl crawler library to capture the hot dog song list. if you are interested in the PHP Tutorial, refer to it. I watched Web crawler-related videos and wanted to crawl something. Recently, there was a fierce competition in table packages on Facebook. I wanted to crawl all the table packages, but I didn't find a suitable VPN for a while. so I had to crawl the song form like a video, capture the songs and brief introductions of cool dog in January to your local device. The code is a bit messy, and I am not satisfied with it. I don't want to put it in sight. But I thought it was my first crawler to record a "first time" in my life. why? So... there is the following unsightly code ~~~ (Ps. I directly add, delete, and modify the example. php file in the PHPCrawl crawler Library. Because the captured data volume is small, I didn't consider multi-process or anything, but I read the PHPCrawl document and found that the PHPCrawl Library has encapsulated all the functions I can think, it is easy to implement. YY will use it to crawl some "big data" next time, and use a visual tool for data analysis. I am still a little excited ~)
Header ("Content-type: text/html; charset = utf-8 ");
// It may take a whils to crawl a site...
Set_time_limit (10000 );
Include ("libs/PHPCrawler. class. php ");
Class MyCrawler extends PHPCrawler {
Function handleDocumentInfo ($ DocInfo ){
// Just detect linebreak for output ("\ n" in CLI-mode, otherwise"
").
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";
$ Url = $ DocInfo-> url;
$ Pat = "/http: \/www \. kugou \. com \/yy \/special \/single \/\ d + \. html /";
If (preg_match ($ pat, $ url)> 0 ){
$ This-> parseSonglist ($ DocInfo );
}
Flush ();
}
Public function parseSonglist ($ DocInfo ){
$ Content = $ DocInfo-> content;
$ SonglistArr = array ();
$ SonglistArr ['raw _ url'] = $ DocInfo-> url;
// Parse the song introduction
$ Matches = array ();
$ Pat = "/Name: <\/span> ([^ ( $ Ret = preg_match ($ pat, $ content, $ matches );
If ($ ret> 0 ){
$ SonglistArr ['title'] = $ matches [1];
} Else {
$ SonglistArr ['title'] = '';
}
// Parse the song
$ Pat = "/$ matches = array ();
Preg_match_all ($ pat, $ content, $ matches );
$ SonglistArr ['nginx'] = array ();
For ($ I = 0; $ I <count ($ matches [0]); $ I ++ ){
$ Song_title = $ matches [1] [$ I];
Array_push ($ songlistArr ['songs'], array ('title' => $ song_title ));
}
Echo"
";
print_r($songlistArr);
echo "
";
}
}
$ Crawler = new MyCrawler ();
// URL to crawl
$ Start_url = "http://www.kugou.com/yy/special/index/1-0-2.html ";
$ Crawler-> setURL ($ start_url );
// Only receive content of files with content-type "text/html"
$ Crawler-> addContentTypeReceiveRule ("# text/html #");
// Link Extension
$ Crawler-> addURLFollowRule ("# http: // www \. kugou \. com/yy/special/single/\ d + \. html $ # I ");
$ Crawler-> addURLFollowRule ("# http://www.kugou \. com/yy/special/index/\ d +-2 \. html $ # I ");
// Store and send cookie-data like a browser does
$ Crawler-> enableCookieHandling (true );
// Set the traffic-limit to 1 MB (1000*1024) (in bytes,
// For testing we dont want to "suck" the whole site)
// The crawling size is not limited.
$ Crawler-> setTrafficLimit (0 );
// Thats enough, now here we go
$ Crawler-> go ();
// At the end, after the process is finished, we print a short
// Report (see method getProcessReport () for more information)
$ Report = $ crawler-> getProcessReport ();
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";
Echo "Summary:". $ lb;
Echo "Links followed:". $ report-> links_followed. $ lb;
Echo "Documents received ed:". $ report-> files_received. $ lb;
Echo "Bytes received:". $ report-> bytes_received. "bytes". $ lb;
Echo "Process runtime:". $ report-> process_runtime. "sec". $ lb;
?>
The above introduces the PHPCrawl crawler library to capture the hot dog song list, including the content, hope to be helpful to friends interested in PHP tutorials.