PHP's Curl function is really powerful. There is a curl_multi_init function, that is, batch processing tasks. This can be used to achieve multi-process synchronization to fetch multiple records, optimize the normal web crawler.
A simple fetch function:
functionHttp_get_multi ($urls){ $count=Count($urls); $data= []; $chs= []; //Create a batch curl handle $MH=Curl_multi_init (); //Create a Curl resource for($i= 0;$i<$count;$i++){ $chs[$i] =Curl_init (); //set the URL and the appropriate optionscurl_setopt ($chs[$i], Curlopt_returntransfer, 1);//return don ' t printcurl_setopt ($chs[$i], Curlopt_url,$urls[$i]); curl_setopt ($chs[$i], Curlopt_header, 0); Curl_multi_add_handle ($MH,$chs[$i ]); } //add handle//for ($i = 0; $i < $count; $i + +) {//Curl_multi_add_handle ($MH, $chs [$i]); }//Execute batch handle Do { $MRC= Curl_multi_exec ($MH,$active); } while($active> 0); while($activeand$MRC==CURLM_OK) { if(Curl_multi_select ($MH)! =-1) { Do { $MRC= Curl_multi_exec ($MH,$active); } while($MRC==curlm_call_multi_perform); } } for($i= 0;$i<$count;$i++){ $content= Curl_multi_getcontent ($chs[$i ]); $data[$i] = (Curl_errno ($chs[$i]) = = 0)?$content:false; } //Close all handles for($i= 0;$i<$count;$i++) {Curl_multi_remove_handle ($MH,$chs[$i ]); } curl_multi_close ($MH); return $data;}
The following call test (get () function as here: http://www.cnblogs.com/whatmiss/p/7114954.html):
Get a URL for a lot of pages
$url= [ ' http://www.baidu.com ', ' http://www.163.com ', ' http://www.sina.com.cn ', ' http://www.qq.com ', ' Http://www.sohu. com ', ' http://www.douban.com ', ' http://www.cnblogs.com ', ' http://www.taobao.com ', ' http://www.php.net ',];$urls= []; for($i= 0;$i< 10;$i++){ foreach($url as $r) $urls[] =$r. '/?v= '.Rand();}
Concurrent Requests$datas= Http_get_multi ($urls); foreach($datas as $key=$data){ file_put_contents(' Log/multi_ '.$key. '. txt ',$data); Record the request result. Remember to create a log folder}$t 2=Microtime(true);Echo $t 2-$t 1;Echo' <br/> ';
Synchronous request, get () function as here: http://www.cnblogs.com/whatmiss/p/7114954.html$t 1=Microtime(true);foreach($urls as $key=$url){ file_put_contents(' log/get_ '.$key. '. txt ', get ($url)); Record the request result. Remember to create a log folder}$t 2=Microtime(true);Echo $t 2-$t 1;
The test results are obvious gaps, and as the volume of data increases, there is an exponential widening gap:
2.448140144348121.689239978798.92550992965724.731415033343.24318504333523.3843379020693.284188032150324.754415035248 3.209182977676429.068662881851
Reference, thank the original
http://php.net/manual/zh/function.curl-multi-init.php
Http://www.tuicool.com/articles/auiEBb
http://blog.csdn.net/liylboy/article/details/39669963 This article writes about a possible time-out problem
Another, here is an article said, multithreading is not faster, even a little bit slower, I feel very strange, how can have such a result:
http://www.webkaka.com/tutorial/php/2013/102843/
Use PHP's curl to implement concurrent requests for remote files (crawl remote Web pages)