A lightweight and simple crawler implemented by PHP-php source code

Source: Internet
Author: User
Tags processing text
This article mainly introduces a lightweight and simple crawler implemented by PHP. This article summarizes some crawler knowledge, such as the crawler structure, regular expressions, and other issues, and then provides the crawler implementation code, you can refer to the following article to introduce a lightweight and simple crawler implemented by PHP. This article summarizes some crawler knowledge, such as the crawler structure, regular expressions, and other problems, then, the crawler implementation code is provided. For more information, see

Recently, we need to collect data. it is very troublesome to save the data in a browser, and it is not conducive to storage and retrieval. So I wrote a small crawler and crawled on the internet. so far, I have crawled nearly webpages. We are working on a way to process the data.

Crawler structure:
The crawling principle is actually very simple. it is to analyze the download page, find out the connection, then download these links, analyze and download them again, and repeat them. In terms of data storage, the database is the first choice for easy retrieval. in development languages, you only need to support regular expressions. I chose mysql for the database. Therefore, I chose php for the development script. It supports perl compatibility with regular expressions. it is easy to connect to mysql and supports http download. it can be deployed in both windows and linux systems.

Regular expression:
Regular expressions are basic tools for processing text. to retrieve links and images in html, use the following regular expressions.


The code is as follows:


"#] + Href = (['\"]) (. +) \ 1 # isU "processing link
"#] + Src = (['\"]) (. +) \ 1 # isU "process images

Other questions:
A problem that needs to be paid attention to when writing crawlers is that the URLs that have already been downloaded cannot be downloaded repeatedly, but some web page links form a loop, so you need to solve this problem, my solution is to calculate the MD5 value of the processed url and store it in the database to check whether the url has been downloaded. Of course there are better algorithms. if you are interested, you can find them online.

Related protocols:
Crawlers also have suggestions. some robots.txt files define websites that allow traversal. However, due to my limited time, this function is not implemented.

Other instructions:
Php supports class programming. I write the main crawler classes.
1. url processing web_site_info is mainly used to process URLs and analyze domain names.
2. perform database operations mysql_insert.php to process database-related operations.
3. process historical records and record URLs that have been processed.
4. crawling.

Existing problems and deficiencies

This crawler runs well in the case of a small amount of data, but in the case of a large amount of data, the efficiency of the history processing class is not very high. in the database structure, related fields are indexed to improve the speed, but data needs to be constantly read, which may be related to the array implementation of php. if 0.1 million historical records are loaded at a time, the speed is very slow.
Multithreading is not supported. Only one url can be processed at a time.
Php running itself has a memory usage limit. Once a page with a depth of 20 is captured, the program with exhausted memory is killed.


Create a net_spider database in mysql and then use db. SQL to create related tables. Set the mysql username and password in config. php.
Last

Php-f spider. php depth (numeric value) url

You can start your work. For example

Php-f spider. php 20 https://www.php1.cn/

Now I feel that it is not that complicated to make a Crawler. what is difficult is data storage and retrieval. In my current database, the largest data table is already 15 GB and I am trying to process the data. I feel a little unable to perform the query in mysql. I really admire google.


 \ '\ "\] *). *?> /", $ Spider_page_result, $ out); if ($ get_url_result) {return $ out [1];} else {return ;}} # relative path to absolute path function xdtojd ($ base_url, $ url_list) {if (is_array ($ url_list) {foreach ($ url_list as $ url_item) {if (preg_match ("/^ (http: \/| https: \/| javascript :)/", $ url_item )) {$ result_url_list [] = $ url_item;} else {if (preg_match ("/^ \/", $ url_item) {$ real_url = $ base_url. $ url_item;} else {$ real_url = $ base_url. "/". $ url_item;} # $ real_url =' http://www.sumpay.cn/ '. $ Url_item; $ result_url_list [] = $ real_url; }}return $ result_url_list;} else {return ;}# delete other sites urlfunction other_site_url_del ($ jd_url_list, $ url_base) {if (is_array ($ jd_url_list) {foreach ($ jd_url_list as $ all_url) {echo $ all_url; if (strpos ($ all_url, $ url_base) === 0) {$ all_url_list [] = $ all_url;} return $ all_url_list;} else {return ;}# delete the same URLfunction url_same_del ($ array_url) {if (is_array ($ array_ur L) {$ insert_url = array (); $ pizza = file_get_contents ("/tmp/url.txt"); if ($ pizza) {$ pizza = explode ("\ r \ n", $ pizza); foreach ($ array_url as $ array_value_url) {if (! In_array ($ array_value_url, $ pizza) {$ insert_url [] = $ array_value_url;} if ($ insert_url) {foreach ($ insert_url as $ key => $ insert_url_value) {# here, only the same parameters are used for deduplicated processing $ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value ); foreach ($ pizza as $ pizza_value) {$ update_pizza_value = preg_replace ('/= [^ &] */', '= lesec', $ pizza_value ); if ($ update_insert_url = $ update_pizza_value) {unset ($ insert_url [$ key]); Continue ;}}} else {$ insert_url = array (); $ insert_new_url = array (); $ insert_url = $ array_url; foreach ($ insert_url as $ insert_url_value) {$ update_insert_url = preg_replace ('/= [^ &] */', '= leesec', $ insert_url_value); $ insert_new_url [] = $ update_insert_url ;} $ insert_new_url = array_unique ($ insert_new_url); foreach ($ insert_new_url as $ key => $ response) {$ insert_url_bf [] = $ insert_url [$ key];} $ insert _ Url = $ insert_url_bf;} return $ insert_url;} else {return; }}$ current_url = $ argv [1]; $ fp_puts = fopen ("/tmp/url.txt ", "AB"); // record the url list $ fp_gets = fopen ("/tmp/url.txt", "r "); // Save the url list $ url_base_url = parse_url ($ current_url); if ($ url_base_url ['scheme '] = "") {$ url_base = "http ://". $ url_base_url ['host'];} else {$ url_base = $ url_base_url ['scheme ']. "://". $ url_base_url ['host'];} do {$ spider_page_result = curl_get ($ curre Nt_url); # var_dump ($ spider_page_result); $ url_list = get_page_urls ($ spider_page_result, $ url_base); # var_dump ($ url_list); if (! $ Url_list) {continue;} $ jd_url_list = xdtojd ($ url_base, $ url_list); # var_dump ($ jd_url_list); $ scheme = aggregate ($ jd_url_list, $ url_base ); var_dump ($ scheme); $ scheme = url_same_del ($ result_url_arr); # var_dump ($ scheme); if (is_array ($ scheme) {$ result_url_arr = array_unique ($ scheme ); foreach ($ result_url_arr as $ new_url) {fputs ($ fp_puts, $ new_url. "\ r \ n "); }}} While ($ current_url = fgets ($ fp_gets, 1024 )); // continuously obtain the url preg_match_all ("/] + href = [\" '] ([^ \ "'] +) [\ "'] [^>] +>/", $ spider_page_result, $ out); # echo a href # var_dump ($ out [1]);?>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.