Use PHP's curl to implement concurrent requests for remote files (crawl remote Web pages)

Last Update:2017-07-04 Source: Internet

Author: User

Tags curl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PHP's Curl function is really powerful. There is a curl_multi_init function, that is, batch processing tasks. This can be used to achieve multi-process synchronization to fetch multiple records, optimize the normal web crawler.

A simple fetch function:

functionHttp_get_multi ($urls){    $count=Count($urls); $data= []; $chs= []; //Create a batch curl handle    $MH=Curl_multi_init (); //Create a Curl resource     for($i= 0;$i<$count;$i++){        $chs[$i] =Curl_init (); //set the URL and the appropriate optionscurl_setopt ($chs[$i], Curlopt_returntransfer, 1);//return don ' t printcurl_setopt ($chs[$i], Curlopt_url,$urls[$i]); curl_setopt ($chs[$i], Curlopt_header, 0); Curl_multi_add_handle ($MH,$chs[$i ]); }    //add handle//for ($i = 0; $i < $count; $i + +) {//Curl_multi_add_handle ($MH, $chs [$i]); }//Execute batch handle     Do {        $MRC= Curl_multi_exec ($MH,$active); }  while($active> 0);  while($activeand$MRC==CURLM_OK) {        if(Curl_multi_select ($MH)! =-1) {             Do {                $MRC= Curl_multi_exec ($MH,$active); }  while($MRC==curlm_call_multi_perform); }    }     for($i= 0;$i<$count;$i++){        $content= Curl_multi_getcontent ($chs[$i ]); $data[$i] = (Curl_errno ($chs[$i]) = = 0)?$content:false; }    //Close all handles     for($i= 0;$i<$count;$i++) {Curl_multi_remove_handle ($MH,$chs[$i ]); } curl_multi_close ($MH); return $data;}

The following call test (get () function as here: http://www.cnblogs.com/whatmiss/p/7114954.html):

Get a URL for a lot of pages
$url= [    ' http://www.baidu.com ', ' http://www.163.com ', ' http://www.sina.com.cn ', ' http://www.qq.com ', ' Http://www.sohu. com ', ' http://www.douban.com ', ' http://www.cnblogs.com ', ' http://www.taobao.com ', ' http://www.php.net ',];$urls= []; for($i= 0;$i< 10;$i++){    foreach($url  as $r)        $urls[] =$r. '/?v= '.Rand();}

Concurrent Requests$datas= Http_get_multi ($urls); foreach($datas  as $key=$data){    file_put_contents(' Log/multi_ '.$key. '. txt ',$data); Record the request result. Remember to create a log folder}$t 2=Microtime(true);Echo $t 2-$t 1;Echo' <br/> ';

Synchronous request, get () function as here: http://www.cnblogs.com/whatmiss/p/7114954.html$t 1=Microtime(true);foreach($urls  as $key=$url){    file_put_contents(' log/get_ '.$key. '. txt ', get ($url)); Record the request result. Remember to create a log folder}$t 2=Microtime(true);Echo $t 2-$t 1;

The test results are obvious gaps, and as the volume of data increases, there is an exponential widening gap:

2.448140144348121.689239978798.92550992965724.731415033343.24318504333523.3843379020693.284188032150324.754415035248 3.209182977676429.068662881851

Reference, thank the original

http://php.net/manual/zh/function.curl-multi-init.php

Http://www.tuicool.com/articles/auiEBb

http://blog.csdn.net/liylboy/article/details/39669963 This article writes about a possible time-out problem

Another, here is an article said, multithreading is not faster, even a little bit slower, I feel very strange, how can have such a result:

http://www.webkaka.com/tutorial/php/2013/102843/

Use PHP's curl to implement concurrent requests for remote files (crawl remote Web pages)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More