perl中用多線程和持續串連實現高速WEB請求

來源:互聯網
上載者:User

我經常需要提取大量的(1500頁以上)網頁資料,曾嘗試過很多方法,雖然都能實現,但效率都不是太高。

剛開始用LWP::Simple(get)按順序邊下載邊提取,這種方法很容易控制,也很可靠,下載中途中斷了可以通過檢查資料的完整性斷點續傳,下載的網頁資料並不存入本地硬碟,僅儲存提取後的少量資料,硬碟操作少,但下載效率很低;

為了加快下載速度,考慮用第三方下載軟體先下載網頁資料,再用程式從已下載的網頁中提取資料。所以開始使用Teleport下載網頁資料,Teleport支援多線程下載,速度提高了至少3-5倍。但用Teleport下載大量網頁的時候,也會有下載失敗的情況,程式本身並不自動檢測並重新下載,通常需要手工重複下載一次以檢驗資料的完整性,這需要額外消耗一些時間。這種方法效率較高,也很穩定,我使用了很長時間;

後來在CU論壇看到仙子發的多執行緒模式,就將自己的代碼改造成多線程下載,效率比Teleport快了不少,但仙子給的多執行緒模式很難控制,下載過程中經常發獃很久,下載失敗率很高(10%左右),下載失敗的任務無法再繼續通過多線程下載(我沒找到方法),不能即時顯示下載進度,在下載後期經常假死,處於無限期等待狀態,程式不再繼續運行,不得不手工關閉。仙子發的多進程模型也嘗試過,多線程中的問題,多進程同樣存在,而且資源消耗非常大。所以不得不放棄,曾對PERL的多線程和多進程不抱希望;

再後來,Perl China官方QQ群群主莫言給了個終極解決方案,使用LWP::ConnCache建立持續串連,並結合多線程(線程池方式)實現高速WEB資料請求。這種方案非常高效,下載速度是Teleport的3-5倍,而且易於控制,能即時顯示下載進度。

現共用給大家,以求共同進步。

(以下代碼以請求1000次百度首頁為例,下載測試環境為:XP,ActivePerl 5.10.1007,1M電信寬頻,測試資料僅供參考)

1.順序請求,不使用持續串連,程式每請求一次需要與伺服器建立一次串連,這會耗費大量時間,平均下載速度約為每秒0.7次;

#!/usr/bin/perluse strict;use warnings;use LWP::UserAgent;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000; print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');for(1..$request_times) {    my $request = HTTP::Request->new(GET=>$url);    $request->header(Accept=>'text/html');    my $response = $lwp->request($request);    if ($response->is_success) {        print " $_\tOK!\n";    }    else {          print " $_\tFaild!\n";        redo;       }}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";

2.順序請求,使用持續串連,向同一伺服器多次發送請求僅需建立一次串連,平均下載速度約為每秒7.2次,下載效率有明顯提高;

use strict;use warnings;use LWP::UserAgent;use LWP::ConnCache;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000; print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');my $conncache = new LWP::ConnCache;$lwp->conn_cache($conncache);for(1..$request_times) {    my $request = HTTP::Request->new(GET=>$url);    $request->header(Accept=>'text/html');    my $response = $lwp->request($request);    if ($response->is_success) {        print " $_\tOK!\n";    }    else {          print " $_\tFaild!\n";        redo;       }}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";

#!/usr/bin/perluse strict;use warnings;use threads;use threads::shared;use Thread::Queue;use LWP::UserAgent;use LWP::ConnCache;use Benchmark;my $TT0 = new Benchmark;my $url = "http://www.baidu.com";my $request_times = 1000;print "\n Now begin testing ... \n";my $lwp = new LWP::UserAgent(agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; CIBA)');my $conncache = new LWP::ConnCache;$lwp->conn_cache($conncache);my $data_queue = new Thread::Queue;my $result_queue = new Thread::Queue;my $processing_count :shared = 0;my $MAX_THREADS = 10;my $num = 1;for (my $n = 0; $n < $MAX_THREADS; $n++){    threads->create(\&thread_io);}foreach my $data(1..$request_times){    if ($data_queue ->pending() > $MAX_THREADS * 2)    {        select(undef, undef, undef, 0.02);        redo;    }    $data_queue->enqueue($data);    if ($result_queue->pending() > 0)    {        while (my $result = $result_queue->dequeue_nb())   {            if($result) { print " $num\tOK!\n"; }            else { print " $num\tFailed!\n"; }            $num++;        }    }}while ($processing_count > 0 or $data_queue->pending() > 0 or $result_queue->pending() > 0){    select(undef, undef, undef, 0.02);    while (my $result = $result_queue->dequeue_nb())    {        if($result) { print " $num\tOK!\n"; }        else { print " $num\tFailed!\n"; }        $num++;    }}foreach my $thread (threads->list()){    $thread->detach();}my $TT1 = new Benchmark;my $td = Benchmark::timediff($TT1, $TT0);$td = Benchmark::timestr($td);my ($sec) = ($td =~ /(\d+).*/);my $speed = sprintf("%0.1f",$request_times/$sec);print "\n Time expend: $td\n Average Speed: $speed Times Per Second\n\n Press Enter to close me ... \7";<STDIN>;##########################################################################################sub thread_io(){    while (my $data = $data_queue->dequeue())    {        {            lock $processing_count; ++$processing_count;        }        my $result = get_html($data);        $result_queue->enqueue($result);        {            lock $processing_count;            --$processing_count;        }    }}sub get_html {    my $no = shift;    my $request = HTTP::Request->new(GET=>$url);    $request->header(Accept=>'text/html');    my $response = $lwp->request($request);    if ($response->is_success) {        return(1);    }    else {        $data_queue->enqueue($no);   #enquenue error request              return(0);    }}                                                    

該多執行緒模式為莫言原創,採用線程池方式,建2個隊列,一個負責向線程隊列新增工作,另一個負責管理工作的處理結果。請求失敗的任務,可以重新排入佇列,以保證每個請求的有效性。該模型易於控制,可靠性高,能即時顯示任務的處理進度,而且效率高。

注意:LWP::ConnCache不支援LWP::Simple,支援LWP::UserAgent。

在此特別感謝莫言。

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.