PHPCrawl crawler Library crawlers

Source: Internet
Author: User
: This article mainly introduces the PHPCrawl crawler library to capture the hot dog song list. if you are interested in the PHP Tutorial, refer to it. I watched Web crawler-related videos and wanted to crawl something. Recently, there was a fierce competition in table packages on Facebook. I wanted to crawl all the table packages, but I didn't find a suitable VPN for a while. so I had to crawl the song form like a video, capture the songs and brief introductions of cool dog in January to your local device. The code is a bit messy, and I am not satisfied with it. I don't want to put it in sight. But I thought it was my first crawler to record a "first time" in my life. why? So... there is the following unsightly code ~~~ (Ps. I directly add, delete, and modify the example. php file in the PHPCrawl crawler Library. Because the captured data volume is small, I didn't consider multi-process or anything, but I read the PHPCrawl document and found that the PHPCrawl Library has encapsulated all the functions I can think, it is easy to implement. YY will use it to crawl some "big data" next time, and use a visual tool for data analysis. I am still a little excited ~)

Header ("Content-type: text/html; charset = utf-8 ");
// It may take a whils to crawl a site...
Set_time_limit (10000 );
Include ("libs/PHPCrawler. class. php ");
Class MyCrawler extends PHPCrawler {
Function handleDocumentInfo ($ DocInfo ){
// Just detect linebreak for output ("\ n" in CLI-mode, otherwise"
").
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";

$ Url = $ DocInfo-> url;
$ Pat = "/http: \/www \. kugou \. com \/yy \/special \/single \/\ d + \. html /";
If (preg_match ($ pat, $ url)> 0 ){
$ This-> parseSonglist ($ DocInfo );
}
Flush ();
}

Public function parseSonglist ($ DocInfo ){
$ Content = $ DocInfo-> content;
$ SonglistArr = array ();
$ SonglistArr ['raw _ url'] = $ DocInfo-> url;
// Parse the song introduction
$ Matches = array ();
$ Pat = "/Name: <\/span> ([^ ( $ Ret = preg_match ($ pat, $ content, $ matches );
If ($ ret> 0 ){
$ SonglistArr ['title'] = $ matches [1];
} Else {
$ SonglistArr ['title'] = '';
}
// Parse the song
$ Pat = "/$ matches = array ();
Preg_match_all ($ pat, $ content, $ matches );
$ SonglistArr ['nginx'] = array ();
For ($ I = 0; $ I <count ($ matches [0]); $ I ++ ){
$ Song_title = $ matches [1] [$ I];
Array_push ($ songlistArr ['songs'], array ('title' => $ song_title ));
}
Echo"

";
print_r($songlistArr);
echo "
";
}
}
$ Crawler = new MyCrawler ();
// URL to crawl
$ Start_url = "http://www.kugou.com/yy/special/index/1-0-2.html ";
$ Crawler-> setURL ($ start_url );
// Only receive content of files with content-type "text/html"
$ Crawler-> addContentTypeReceiveRule ("# text/html #");
// Link Extension
$ Crawler-> addURLFollowRule ("# http: // www \. kugou \. com/yy/special/single/\ d + \. html $ # I ");
$ Crawler-> addURLFollowRule ("# http://www.kugou \. com/yy/special/index/\ d +-2 \. html $ # I ");
// Store and send cookie-data like a browser does
$ Crawler-> enableCookieHandling (true );
// Set the traffic-limit to 1 MB (1000*1024) (in bytes,
// For testing we dont want to "suck" the whole site)
// The crawling size is not limited.
$ Crawler-> setTrafficLimit (0 );
// Thats enough, now here we go
$ Crawler-> go ();
// At the end, after the process is finished, we print a short
// Report (see method getProcessReport () for more information)
$ Report = $ crawler-> getProcessReport ();
If (PHP_SAPI = "cli") $ lb = "\ n ";
Else $ lb ="
";

Echo "Summary:". $ lb;
Echo "Links followed:". $ report-> links_followed. $ lb;
Echo "Documents received ed:". $ report-> files_received. $ lb;
Echo "Bytes received:". $ report-> bytes_received. "bytes". $ lb;
Echo "Process runtime:". $ report-> process_runtime. "sec". $ lb;
?>

The above introduces the PHPCrawl crawler library to capture the hot dog song list, including the content, hope to be helpful to friends interested in PHP tutorials.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.