A PHP-implemented lightweight simple crawler, crawler-PHP Tutorial

Source: Internet
Author: User
Tags processing text
A lightweight simple crawler and crawler implemented by PHP. A lightweight and simple crawler implemented by PHP. crawlers need to collect data recently. it is very troublesome to save data on a browser, and it is not conducive to storage and retrieval. Therefore, I wrote a PHP-implemented lightweight simple crawler and crawler.

Recently, we need to collect data. it is very troublesome to save the data in a browser, and it is not conducive to storage and retrieval. So I wrote a small crawler and crawled on the internet. so far, I have crawled nearly webpages. We are working on a way to process the data.

Crawler structure:
The crawling principle is actually very simple. it is to analyze the download page, find out the connection, then download these links, analyze and download them again, and repeat them. In terms of data storage, the database is the first choice for easy retrieval. in development languages, you only need to support regular expressions. I chose mysql for the database. Therefore, I chose php for the development script. It supports perl compatibility with regular expressions. it is easy to connect to mysql and supports http download. it can be deployed in both windows and linux systems.

Regular expression:
Regular expressions are basic tools for processing text. to retrieve links and images in html, use the following regular expressions.

The code is as follows:
"#] + Href = (['\"]) (. +) \ 1 # isU "processing link
"#] + Src = (['\"]) (. +) \ 1 # isU "process images

Other questions:
A problem that needs to be paid attention to when writing crawlers is that the URLs that have already been downloaded cannot be downloaded repeatedly, but some web page links form a loop, so you need to solve this problem, my solution is to calculate the MD5 value of the processed url and store it in the database to check whether the url has been downloaded. Of course there are better algorithms. if you are interested, you can find them online.

Related protocols:
Crawlers also have suggestions. some robots.txt files define websites that allow traversal. However, due to my limited time, this function is not implemented.

Other instructions:
Php supports class programming. I write the main crawler classes.
1. url processing web_site_info is mainly used to process URLs and analyze domain names.
2. perform database operations mysql_insert.php to process database-related operations.
3. process historical records and record URLs that have been processed.
4. crawling.

Existing problems and deficiencies

This crawler runs well in the case of a small amount of data, but in the case of a large amount of data, the efficiency of the history processing class is not very high. in the database structure, related fields are indexed to improve the speed, but data needs to be constantly read, which may be related to the array implementation of php. if 0.1 million historical records are loaded at a time, the speed is very slow.
Multithreading is not supported. Only one url can be processed at a time.
Php running itself has a memory usage limit. Once a page with a depth of 20 is captured, the program with exhausted memory is killed.

The following url is for source code download.

Http://xiazai.jb51.net/201506/other/net_spider.rar


Create a net_spider database in mysql and then use db. SQL to create related tables. Set the mysql username and password in config. php.
Last

The code is as follows:
Php-f spider. php depth (numeric value) url

You can start your work. For example

The code is as follows:
Php-f spider. php 20 http://news.sina.com.cn

Now I feel that it is not that complicated to make a Crawler. what is difficult is data storage and retrieval. In my current database, the largest data table is already 15 GB and I am trying to process the data. I feel a little unable to perform the query in mysql. I really admire google.

<? Php # load page function curl_get ($ url) {$ ch = curl_init (); curl_setopt ($ ch, CURLOPT_URL, $ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); curl_setopt ($ ch, CURLOPT_HEADER, 1); $ result = curl_exec ($ ch); $ code = curl_getinfo ($ ch, CURLINFO_HTTP_CODE); if ($ code! = '20140901' & $ result) {return $ result;} curl_close ($ ch) ;}# obtain the page url link function get_page_urls ($ spider_page_result, $ base_url) {$ get_url_result = preg_match_all ("/<[a | A]. *? Href = [\ '\ "] {0, 1} ([^> \' \" \] *). *?> /", $ Spider_page_result, $ out); if ($ get_url_result) {return $ out [1];} else {return ;}} # relative path to absolute path function xdtojd ($ base_url, $ url_list) {if (is_array ($ url_list) {foreach ($ url_list as $ url_item) {if (preg_match ("/^ (http: \/| https: \/| javascript :)/", $ url_item )) {$ result_url_list [] = $ url_item;} else {if (preg_match ("/^ \/", $ url_item) {$ real_url = $ base_url. $ url_item;} else {$ real_url = $ base_url. "/". $ url_item;} # $ real_url =' http://www.sumpay.cn/ '. $ Url_item; $ result_url_list [] = $ real_url; }}return $ result_url_list;} else {return ;}# delete other sites urlfunction other_site_url_del ($ jd_url_list, $ url_base) {if (is_array ($ jd_url_list) {foreach ($ jd_url_list as $ all_url) {echo $ all_url; if (strpos ($ all_url, $ url_base) === 0) {$ all_url_list [] = $ all_url;} return $ all_url_list;} else {return ;}# delete the same URLfunction url_same_del ($ array_url) {if (is_array ($ array_ur L) {$ insert_url = array (); $ pizza = file_get_contents ("/tmp/url.txt"); if ($ pizza) {$ pizza = explode ("\ r \ n", $ pizza); foreach ($ array_url as $ array_value_url) {if (! In_array ($ array_value_url, $ pizza) {$ insert_url [] = $ array_value_url;} if ($ insert_url) {foreach ($ insert_url as $ key => $ insert_url_value) {# here, only the same parameters are used for deduplicated processing $ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value ); foreach ($ pizza as $ pizza_value) {$ update_pizza_value = preg_replace ('/= [^ &] */', '= lesec', $ pizza_value ); if ($ update_insert_url = $ update_pizza_value) {unset ($ insert_url [$ key]); Continue ;}}} else {$ insert_url = array (); $ insert_new_url = array (); $ insert_url = $ array_url; foreach ($ insert_url as $ insert_url_value) {$ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value); $ insert_new_url [] = $ update_insert_url ;} $ insert_new_url = array_unique ($ insert_new_url); foreach ($ insert_new_url as $ key => $ response) {$ insert_url_bf [] = $ insert_url [$ key];} $ insert _ Url = $ insert_url_bf;} return $ insert_url;} else {return; }}$ current_url = $ argv [1]; $ fp_puts = fopen ("/tmp/url.txt ", "AB"); // record the url list $ fp_gets = fopen ("/tmp/url.txt", "r "); // Save the url list $ url_base_url = parse_url ($ current_url); if ($ url_base_url ['scheme '] = "") {$ url_base = "http ://". $ url_base_url ['host'];} else {$ url_base = $ url_base_url ['scheme ']. "://". $ url_base_url ['host'];} do {$ spider_page_result = curl_get ($ curre Nt_url); # var_dump ($ spider_page_result); $ url_list = get_page_urls ($ spider_page_result, $ url_base); # var_dump ($ url_list); if (! $ Url_list) {continue;} $ jd_url_list = xdtojd ($ url_base, $ url_list); # var_dump ($ jd_url_list); $ scheme = aggregate ($ jd_url_list, $ url_base ); var_dump ($ scheme); $ scheme = url_same_del ($ result_url_arr); # var_dump ($ scheme); if (is_array ($ scheme) {$ result_url_arr = array_unique ($ scheme ); foreach ($ result_url_arr as $ new_url) {fputs ($ fp_puts, $ new_url. "\ r \ n "); }}} While ($ current_url = fgets ($ fp_gets, 1024 )); // continuously obtain the url preg_match_all ("/] + href = [\" '] ([^ \ "'] +) [\ "'] [^>] +>/", $ spider_page_result, $ out); # echo a href # var_dump ($ out [1]);?>

Parse needs to collect data recently. it is very troublesome to save data in a browser, and it is not conducive to storage and retrieval. So I wrote it myself...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.