This article mainly introduces a lightweight and simple crawler implemented by PHP. This article summarizes some crawler knowledge, such as the crawler structure, regular expressions, and other issues, and then provides the crawler implementation code, you can refer to the following article to introduce a lightweight and simple crawler implemented by PHP. This article summarizes some crawler knowledge, such as the crawler structure, regular expressions, and other problems, then, the crawler implementation code is provided. For more information, see
Recently, we need to collect data. it is very troublesome to save the data in a browser, and it is not conducive to storage and retrieval. So I wrote a small crawler and crawled on the internet. so far, I have crawled nearly webpages. We are working on a way to process the data.
Crawler structure:
The crawling principle is actually very simple. it is to analyze the download page, find out the connection, then download these links, analyze and download them again, and repeat them. In terms of data storage, the database is the first choice for easy retrieval. in development languages, you only need to support regular expressions. I chose mysql for the database. Therefore, I chose php for the development script. It supports perl compatibility with regular expressions. it is easy to connect to mysql and supports http download. it can be deployed in both windows and linux systems.
Regular expression:
Regular expressions are basic tools for processing text. to retrieve links and images in html, use the following regular expressions.
The code is as follows:
"#] + Href = (['\"]) (. +) \ 1 # isU "processing link
"#] + Src = (['\"]) (. +) \ 1 # isU "process images
Other questions:
A problem that needs to be paid attention to when writing crawlers is that the URLs that have already been downloaded cannot be downloaded repeatedly, but some web page links form a loop, so you need to solve this problem, my solution is to calculate the MD5 value of the processed url and store it in the database to check whether the url has been downloaded. Of course there are better algorithms. if you are interested, you can find them online.
Related protocols:
Crawlers also have suggestions. some robots.txt files define websites that allow traversal. However, due to my limited time, this function is not implemented.
Other instructions:
Php supports class programming. I write the main crawler classes.
1. url processing web_site_info is mainly used to process URLs and analyze domain names.
2. perform database operations mysql_insert.php to process database-related operations.
3. process historical records and record URLs that have been processed.
4. crawling.
Existing problems and deficiencies
This crawler runs well in the case of a small amount of data, but in the case of a large amount of data, the efficiency of the history processing class is not very high. in the database structure, related fields are indexed to improve the speed, but data needs to be constantly read, which may be related to the array implementation of php. if 0.1 million historical records are loaded at a time, the speed is very slow.
Multithreading is not supported. Only one url can be processed at a time.
Php running itself has a memory usage limit. Once a page with a depth of 20 is captured, the program with exhausted memory is killed.
Create a net_spider database in mysql and then use db. SQL to create related tables. Set the mysql username and password in config. php.
Last
Php-f spider. php depth (numeric value) url
You can start your work. For example
Php-f spider. php 20 https://www.php1.cn/
Now I feel that it is not that complicated to make a Crawler. what is difficult is data storage and retrieval. In my current database, the largest data table is already 15 GB and I am trying to process the data. I feel a little unable to perform the query in mysql. I really admire google.
\ '\ "\] *). *?> /", $ Spider_page_result, $ out); if ($ get_url_result) {return $ out [1];} else {return ;}} # relative path to absolute path function xdtojd ($ base_url, $ url_list) {if (is_array ($ url_list) {foreach ($ url_list as $ url_item) {if (preg_match ("/^ (http: \/| https: \/| javascript :)/", $ url_item )) {$ result_url_list [] = $ url_item;} else {if (preg_match ("/^ \/", $ url_item) {$ real_url = $ base_url. $ url_item;} else {$ real_url = $ base_url. "/". $ url_item;} # $ real_url =' http://www.sumpay.cn/ '. $ Url_item; $ result_url_list [] = $ real_url; }}return $ result_url_list;} else {return ;}# delete other sites urlfunction other_site_url_del ($ jd_url_list, $ url_base) {if (is_array ($ jd_url_list) {foreach ($ jd_url_list as $ all_url) {echo $ all_url; if (strpos ($ all_url, $ url_base) === 0) {$ all_url_list [] = $ all_url;} return $ all_url_list;} else {return ;}# delete the same URLfunction url_same_del ($ array_url) {if (is_array ($ array_ur L) {$ insert_url = array (); $ pizza = file_get_contents ("/tmp/url.txt"); if ($ pizza) {$ pizza = explode ("\ r \ n", $ pizza); foreach ($ array_url as $ array_value_url) {if (! In_array ($ array_value_url, $ pizza) {$ insert_url [] = $ array_value_url;} if ($ insert_url) {foreach ($ insert_url as $ key => $ insert_url_value) {# here, only the same parameters are used for deduplicated processing $ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value ); foreach ($ pizza as $ pizza_value) {$ update_pizza_value = preg_replace ('/= [^ &] */', '= lesec', $ pizza_value ); if ($ update_insert_url = $ update_pizza_value) {unset ($ insert_url [$ key]); Continue ;}}} else {$ insert_url = array (); $ insert_new_url = array (); $ insert_url = $ array_url; foreach ($ insert_url as $ insert_url_value) {$ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value); $ insert_new_url [] = $ update_insert_url ;} $ insert_new_url = array_unique ($ insert_new_url); foreach ($ insert_new_url as $ key => $ response) {$ insert_url_bf [] = $ insert_url [$ key];} $ insert _ Url = $ insert_url_bf;} return $ insert_url;} else {return; }}$ current_url = $ argv [1]; $ fp_puts = fopen ("/tmp/url.txt ", "AB"); // record the url list $ fp_gets = fopen ("/tmp/url.txt", "r "); // Save the url list $ url_base_url = parse_url ($ current_url); if ($ url_base_url ['scheme '] = "") {$ url_base = "http ://". $ url_base_url ['host'];} else {$ url_base = $ url_base_url ['scheme ']. "://". $ url_base_url ['host'];} do {$ spider_page_result = curl_get ($ curre Nt_url); # var_dump ($ spider_page_result); $ url_list = get_page_urls ($ spider_page_result, $ url_base); # var_dump ($ url_list); if (! $ Url_list) {continue;} $ jd_url_list = xdtojd ($ url_base, $ url_list); # var_dump ($ jd_url_list); $ scheme = aggregate ($ jd_url_list, $ url_base ); var_dump ($ scheme); $ scheme = url_same_del ($ result_url_arr); # var_dump ($ scheme); if (is_array ($ scheme) {$ result_url_arr = array_unique ($ scheme ); foreach ($ result_url_arr as $ new_url) {fputs ($ fp_puts, $ new_url. "\ r \ n "); }}} While ($ current_url = fgets ($ fp_gets, 1024 )); // continuously obtain the url preg_match_all ("/] + href = [\" '] ([^ \ "'] +) [\ "'] [^>] +>/", $ spider_page_result, $ out); # echo a href # var_dump ($ out [1]);?>