I often need to extract a large amount of web page data (more than 1500 pages). I have tried many methods and can achieve this, but the efficiency is not too high.
At the beginning, lwp: simple (get) is used to download and extract data in sequence. This method is easy to control and reliable, if the download process is interrupted, you can check the data integrity and resumable data transfer. The downloaded webpage data is not stored in the local hard disk, and only a small amount of extracted data is stored. The hard disk operation is small, but the download efficiency is low;
To speed up the download process, you need to use a third-party software to download web page data first, and then extract data from the Downloaded web page. Therefore, we started to use teleport to download webpage data. Teleport supports multi-threaded download, which is at least 3-5 times faster. However, when you use teleport to download a large number of web pages, the download may fail. The program itself does not automatically detect and re-download the page. You usually need to manually redownload it once to verify data integrity, this requires additional time. This method is highly efficient and stable, and I have been using it for a long time;
Later, on the Cu forum, I saw the multi-thread model developed by fairy. I transformed my code into multi-thread download, which is much faster than teleport, but it is difficult to control the multi-thread model provided by fairy, the download process is often in a long time, and the download failure rate is very high (about 10%). Tasks that fail to be downloaded cannot continue to be downloaded through multiple threads (I did not find a method), and the download progress cannot be displayed instantly, after the download, the program is often suspended for an indefinite period of time. The program does not continue to run and has to be closed manually. Fairy's multi-process model has also been tried. The problem in multithreading also exists in multiple processes, and the resource consumption is very high. So I had to give up and had no hope for Perl's multithreading and multi-process;
Later, Mo Yan, the main group of Perl China's official QQ Group, gave an ultimate solution, using lwp: conncache to establish a persistent connection, and combined with multithreading (Thread Pool Mode) to implement high-speed Web data requests. This solution is very efficient. The download speed is 3-5 times that of teleport, and it is easy to control and can display the download progress in real time.
We will share it with you for common progress.
(The following code uses the 1000 request Baidu homepage as an example to download the test environment XP, ActivePerl 5.10.1007, and 1 Mbit/s Telecom broadband. the test data is for reference only)
1. Sequential requests. Continuous connections are not used. Each request of a program requires a new connection with the server. This takes a lot of time and the average download speed is about 0.7 times per second;
#!/usr/bin/perluse strict;use warnings;use LWP::UserAgent;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000; print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');for(1..$request_times) { my $request = HTTP::Request->new(GET=>$url); $request->header(Accept=>'text/html'); my $response = $lwp->request($request); if ($response->is_success) { print " $_\tOK!\n"; } else { print " $_\tFaild!\n"; redo; }}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";
2. For sequential requests, continuous connections are used. To send requests to the same server multiple times, only one connection is required. The average download speed is about 7.2 times per second, which significantly improves the download efficiency;
use strict;use warnings;use LWP::UserAgent;use LWP::ConnCache;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000; print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');my $conncache = new LWP::ConnCache;$lwp->conn_cache($conncache);for(1..$request_times) { my $request = HTTP::Request->new(GET=>$url); $request->header(Accept=>'text/html'); my $response = $lwp->request($request); if ($response->is_success) { print " $_\tOK!\n"; } else { print " $_\tFaild!\n"; redo; }}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";
#!/usr/bin/perluse strict;use warnings;use threads;use threads::shared;use Thread::Queue;use LWP::UserAgent;use LWP::ConnCache;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000;print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');my $conncache = new LWP::ConnCache;$lwp->conn_cache($conncache);my $data_queue = new Thread::Queue;my $result_queue = new Thread::Queue;my $processing_count :shared = 0;my $MAX_THREADS = 10;my $num = 1;for (my $n = 0; $n < $MAX_THREADS; $n++){ threads->create(\&thread_io);}foreach my $data(1..$request_times){ if ($data_queue ->pending() > $MAX_THREADS * 2) { select(undef, undef, undef, 0.02); redo; } $data_queue->enqueue($data); if ($result_queue->pending() > 0) { while (my $result = $result_queue->dequeue_nb()) { if($result) { print " $num\tOK!\n"; } else { print " $num\tFailed!\n"; } $num++; } }}while ($processing_count > 0 or $data_queue->pending() > 0 or $result_queue->pending() > 0){ select(undef, undef, undef, 0.02); while (my $result = $result_queue->dequeue_nb()) { if($result) { print " $num\tOK!\n"; } else { print " $num\tFailed!\n"; } $num++; }}foreach my $thread (threads->list()){ $thread->detach();}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";<STDIN>;##########################################################################################sub thread_io(){ while (my $data = $data_queue->dequeue()) { { lock $processing_count; ++$processing_count; } my $result = get_html($data); $result_queue->enqueue($result); { lock $processing_count; --$processing_count; } }}sub get_html { my $no = shift; my $request = HTTP::Request->new(GET=>$url); $request->header(Accept=>'text/html'); my $response = $lwp->request($request); if ($response->is_success) { return(1); } else { $data_queue->enqueue($no); #enquenue error request return(0); }}
The multi-threaded model is original in Mo Yan and uses the Thread Pool Mode to create two queues. One is responsible for adding tasks to the thread queue and the other is responsible for managing the processing results of tasks. Failed tasks can be added to the queue again to ensure the validity of each request. This model is easy to control and has high reliability. It can display the task processing progress in real time and has high efficiency.
Note: lwp: conncache does not support lwp: simple, and supports lwp: useragent.
I would like to express my special thanks to Mo Yan.