A PHP-implemented lightweight simple crawler, crawler _php tutorial

Source: Internet
Author: User

A PHP implementation of the lightweight simple crawler, crawler


The recent need to collect information on the browser to save as is really cumbersome, and is not conducive to storage and retrieval. So I wrote a small reptile, crawling on the internet, so far, has climbed nearly millions of pages. We are now looking for ways to deal with this data.

Structure of the crawler:
The principle of the crawler is actually very simple, is to analyze the download page, find out the connection, and then download the links, and then analyze the download, the cycle. In the data storage aspect, the database is the first choice, facilitates the retrieval, but develops the language, as long as supports the regular expression to be possible, the database I chose the MySQL, therefore, the development script I chose PHP. It supports Perl-compatible regular expressions, makes it easy to connect to MySQL, supports HTTP downloads, and can be deployed on both Windows and Linux systems.

Regular Expressions:
Regular expressions are the basic tools for working with text, and to take out links and pictures in HTML, use the following regular expressions.

Copy the Code code as follows:
"#]+href= ([' \"]) (. +) \\1#isu "Processing link
"#]+src= ([' \"]) (. +) \\1#isu "Processing picture

Other questions:
One of the problems to be aware of is that the URLs that have already been downloaded cannot be downloaded repeatedly, and some web links form loops, so this problem needs to be dealt with by calculating the MD5 value of the URL that has been processed and storing it in the database so that it can be checked for download. Of course there are better algorithms, if interested, can be found on the Internet.

Related agreements:
Crawlers also have their own protocols, there is a robots.txt file that defines those that are allowed to traverse the site, but because of my limited time, did not implement this function.

Other Notes:
PHP supports class programming, and I write the main classes of reptiles.
1.url processing Web_site_info, mainly with processing URLs, analysis of domain names and so on.
2. Database operations mysql_insert.php, processing and database-related operations.
3. History processing, record URLs that have been processed.
4. Reptiles.

Existing problems and deficiencies

This crawler in the case of small data, run well, but in the case of large data volume, the efficiency of the history processing class is not very high, through the database structure, the related fields indexed, the speed has been improved, but need to constantly read the data, and the PHP itself may be related to the array implementation, If you load 100,000 history at a time, the speed is very slow.
Multi-threading is not supported and only one URL can be processed at a time.
PHP run itself has a memory usage limit, once in the crawl depth of 20 pages, the memory ran out of the program was killed.

The following URL is the source code download.

Http://xiazai.jb51.net/201506/other/net_spider.rar


When used, create a net_spider database in MySQL and then create a related table with Db.sql. Then set the username password for MySQL in config.php.
At last

Copy the Code code as follows:
Php-f spider.php Depth (value) URL

You can start working. Such as

Copy the Code code as follows:
Php-f spider.php http://news.sina.com.cn

Now feel down, actually do a crawler is not so complicated, difficult is the data storage and retrieval. My current database, the largest one data table has been 15G, is trying to handle this data, MySQL query has been feeling a little overwhelmed. I really admire Google for this.

<?php# Load page function Curl_get ($url) {$ch =curl_init ();    curl_setopt ($ch, Curlopt_url, $url);    curl_setopt ($ch, Curlopt_returntransfer, 1);    curl_setopt ($ch, curlopt_header,1);    $result =curl_exec ($ch);    $code =curl_getinfo ($ch, Curlinfo_http_code);    if ($code! = ' 404 ' && $result) {return $result; } curl_close ($ch);} #获取页面url链接function Get_page_urls ($spider _page_result, $base _url) {$get _url_result=preg_match_all ("/<[a|  A].*?href=[\ ' \ "]{0,1} ([^>\ ' \" \]*). *?>/", $spider _page_result, $out);  if ($get _url_result) {return $out [1];  }else{return; }} #相对路径转绝对路径function Xdtojd ($base _url, $url _list) {if (Is_array ($url _list)) {foreach ($url _list as $url _item) {if (preg_    Match ("/^ (http:\/\/|https:\/\/|javascript:)/", $url _item)) {$result _url_list[]= $url _item;     }else {if (Preg_match ("/^\//", $url _item)) {$real _url = $base _url. $url _item; }else{$real _url = $base _url. "     /". $url _item;      } # $real _url = ' http://www.sumpay.cn/'. $url _item; $result_url_list[] = $real _url; }} return $result _url_list; }else{return;}} #删除其他站点urlfunction Other_site_url_del ($jd _url_list, $url _base) {if (Is_array ($jd _url_list)) {foreach ($jd _url_list as    $all _url) {echo $all _url;    if (Strpos ($all _url, $url _base) ===0) {$all _url_list[]= $all _url; }} return $all _url_list; }else{return;}}     #删除相同URLfunction Url_same_del ($array _url) {if (Is_array ($array _url)) {$insert _url=array ();     $pizza =file_get_contents ("/tmp/url.txt");        if ($pizza) {$pizza =explode ("\ r \ n", $pizza); foreach ($array _url as $array _value_url) {if (!in_array ($array _value_url, $pizza)) {$insert _url[]= $array _val          Ue_url; }} if ($insert _url) {foreach ($insert _url as $key = + $insert _url_value) {#这里只做了参数相同去             Re-processing $update _insert_url=preg_replace ('/=[^&]*/', ' =leesec ', $insert _url_value); foreach ($pizza as $pizza _value) {$update _pizza_value=preg_replace('/=[^&]*/', ' =leesec ', $pizza _value);                   if ($update _insert_url== $update _pizza_value) {unset ($insert _url[$key]);                Continue        }}}}}else{$insert _url=array ();        $insert _new_url=array ();        $insert _url= $array _url; foreach ($insert _url as $insert _url_value) {$update _insert_url=preg_replace ('/=[^&]*/', ' =leesec ', $insert _url_v         Alue);          $insert _new_url[]= $update _insert_url;        } $insert _new_url=array_unique ($insert _new_url);        foreach ($insert _new_url as $key = + $insert _new_url_val) {$insert _url_bf[]= $insert _url[$key];     } $insert _url= $insert _url_bf;   } return $insert _url;    }else{return; }} $current _url= $argv [1]; $fp _puts = fopen ("/tmp/url.txt", "AB");//Record URL list $fp _gets = fopen ("/tmp/url.txt", "R");// Save the URL list $url _base_url=parse_url ($current _url), if ($url _base_url[' scheme ']== ") {$url _base="/http ".$url _base_url[' host ';} else{$url _base= $url _base_url[' scheme ']. ":/ /". $url _base_url[' host ');}  do{$spider _page_result=curl_get ($current _url);  #var_dump ($spider _page_result);  $url _list=get_page_urls ($spider _page_result, $url _base);  #var_dump ($url _list);  if (! $url _list) {continue;  } $jd _url_list=xdtojd ($url _base, $url _list);  #var_dump ($jd _url_list);  $result _url_arr=other_site_url_del ($jd _url_list, $url _base);  Var_dump ($result _url_arr);   $result _url_arr=url_same_del ($result _url_arr);   #var_dump ($result _url_arr);       if (Is_array ($result _url_arr)) {$result _url_arr=array_unique ($result _url_arr); foreach ($result _url_arr as $new _url) {fputs ($fp _puts, $new _url. "        \ r \ n "); }}}while ($current _url = fgets ($fp _gets,1024))//constantly get URL preg_match_all ("/]+href=[\" ') ([^\] ']+ ') [\ '][^>]+>/'] , $spider _page_result, $out); # echo a Href#var_dump ($out [1]);? >

http://www.bkjia.com/PHPjc/1028973.html www.bkjia.com true http://www.bkjia.com/PHPjc/1028973.html techarticle PHP Implementation of a lightweight simple crawler, crawlers recently need to collect data, in the browser to save as the way is really cumbersome, and is not conducive to storage and retrieval. So I wrote ...

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.