In actual projects or their own small tools (such as news aggregation, commodity price monitoring, parity) in the process, usually need to get data from the 3rd party Web site or API interface, when the need to deal with 1 URL queues, in order to improve performance, you can use the curl provided curl_multi_* A family function implements simple concurrency.
In this paper, we will explore two specific implementation methods, and do a simple performance comparison for different methods.
1. Classical curl concurrency mechanism and its existing problems
The classic Curl implementation mechanism is easy to find online, such as the following implementation of the PHP online manual:
function Classic_curl ($urls, $delay) {
$queue = Curl_multi_init ();
$map = Array ();
foreach ($urls as $url) {
Create CURL Resources
$ch = Curl_init ();
Set URL and other appropriate options
curl_setopt ($ch, Curlopt_url, $url);
curl_setopt ($ch, curlopt_timeout, 1);
curl_setopt ($ch, Curlopt_returntransfer, 1);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, curlopt_nosignal, true);
Add handle
Curl_multi_add_handle ($queue, $ch);
$map [$url] = $ch;
}
$active = null;
Execute the Handles
do {
$MRC = Curl_multi_exec ($queue, $active);
while ($MRC = = Curlm_call_multi_perform);
while ($active > 0 && $MRC = = CURLM_OK) {
if (Curl_multi_select ($queue, 0.5)!=-1) {
do {
$MRC = Curl_multi_exec ($queue, $active);
while ($MRC = = Curlm_call_multi_perform);
}
}
$responses = Array ();
foreach ($map as $url => $ch) {
$responses [$url] = Callback (Curl_multi_getcontent ($ch), $delay);
Curl_multi_remove_handle ($queue, $ch);
Curl_close ($ch);
}
Curl_multi_close ($queue);
return $responses;
}
First, all the URLs are pressed into the concurrent queue, and then the concurrent process is performed, waiting for subsequent processing of data parsing after all requests have been received. In the actual processing, affected by the network transmission, the content of some URLs will take precedence over the other URLs, but the classic curl concurrency must wait for the slowest URL to return before starting processing, waiting means the CPU idle and waste. If the URL queue is short, this kind of idle and wasteful is in an acceptable range, but if the queue is very long, this wait and waste will become unacceptable.
2. Improved rolling curl concurrency method
Careful analysis is not difficult to find the classic curl concurrency there are optimized space, the optimization of the way when a URL after a request to deal with it as quickly as possible, while processing while waiting for other URLs to return, rather than waiting for the slowest interface to return to start processing and so on, so as to avoid CPU idle and waste. Gossip is not much said, the following affixed to the specific implementation:
function Rolling_curl ($urls, $delay) {
$queue = Curl_multi_init ();
$map = Array ();
foreach ($urls as $url) {
$ch = Curl_init ();
curl_setopt ($ch, Curlopt_url, $url);
curl_setopt ($ch, curlopt_timeout, 1);
curl_setopt ($ch, Curlopt_returntransfer, 1);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, curlopt_nosignal, true);
Curl_multi_add_handle ($queue, $ch);
$map [(String) $ch] = $url;
}
$responses = Array ();
do {
while (($code = Curl_multi_exec ($queue, $active)) = = Curlm_call_multi_perform);
if ($code!= curlm_ok) {break;}
A request is just completed--find out which one
while ($done = Curl_multi_info_read ($queue)) {
Get the info and content returned on the request
$info = Curl_getinfo ($done [' handle ']);
$error = Curl_error ($done [' handle ']);
$results = Callback (Curl_multi_getcontent ($done [' Handle ']), $delay);
$responses [$map [(String) $done [' handle ']]] = Compact (' Info ', ' error ', ' results ');
Remove the curl handle that just completed
Curl_multi_remove_handle ($queue, $done [' handle ']);
Curl_close ($done [' handle ']);
}
Block for data in/output; Error handling is do by curl_multi_exec
if ($active > 0) {
Curl_multi_select ($queue, 0.5);
}
while ($active);
Curl_multi_close ($queue);
return $responses;
}
3. Performance comparisons for two concurrent implementations
Performance comparison test before and after improvement on the Linux host, the concurrent queues used in the test are as follows:
http://item.taobao.com/item.htm?id=14392877692
http://item.taobao.com/item.htm?id=16231676302
http://item.taobao.com/item.htm?id=17037160462
http://item.taobao.com/item.htm?id=5522416710
http://item.taobao.com/item.htm?id=16551116403
http://item.taobao.com/item.htm?id=14088310973
The principles of the experimental design and the format of the test results are briefly explained: In order to ensure the reliability of the results, each group of experiments repeated 20 times, in a single experiment, given the same set of interface URLs, respectively measuring classic (referring to the classical concurrency mechanism) and rolling (refers to the improved concurrency mechanism) The two concurrency mechanisms are time-consuming (in seconds), the short duration wins (Winner), and the calculated time (excellence, seconds), and performance scaling (Excel.%). In order to be as close to the real request as possible and keep the experiment simple, Only simple regular expression matches are done on the processing of the returned results, but no other complex operations are performed. In addition, in order to determine the effect of the result processing callback on the performance comparison test results, it is possible to use the Usleep simulation real-world data processing logic (such as extraction, participle, write file or database, etc.).
The callback functions that are used in the performance test are:
function callback ($data, $delay) {
Preg_match_all ('/Usleep ($delay);
Return compact (' data ', ' matches ');
}
The data processing callback has no latency: rolling curl is slightly superior, but the performance improvement effect is not obvious.
------------------------------------------------------------------------------------------------delay:0 Micro seconds, equals to 0 milli seconds--------------------------------------------------------------------------------- ---------------Counter Classic Rolling Winner excellence Excel. %------------------------------------------------------------------------------------------------1 0.1193 0.0390 rolling 0.0803 67.31% 2 0.0556 0.0477 0.0079 14.21% 3 0.0461 0.0588 classic-0.0127-21.6% 4 0.0464 0.0385 Rolling 0.0079 17.03% 5 0.0534 0.0448 Rolling 0.0086 16.1% 6 0.0540 0.0714 Classic-0.0174 -24.37% 7 0.0386 0.0416 classic-0.0030-7.21% 8 0.0357 0.0398 classic-0.0041-1 0.3% 9 0.0437 0.0442 classic-0.0005-1.13% 10 0.0319 0.0348 classic-0.0029-8.33% 0.0529 0.0430 rolling 0 .0099 18.71% 0.0503 0.0581 classic-0.0078-13.43% 13 0.0344 0.0225 rolling 0.0119 34.59% 14 0.0397 0.0643 classic-0.0246-38.26% 0.0368 0.0489 Classic-0.0121- 24.74% 0.0502 0.0394 Rolling 0.0108 21.51% 17 0.0592 0.0383 rolling 0.0209 35.3% 0.0302 0.0285 0.0017 5.63% 19 0.0248 0.0553 classic-0.0305-55.15% 20 0.0137 0.01 Rolling 0.0006 4.38%------------------------------------------------------------------------ ------------------------Average 0.0458 0.0436 rolling 0.0022 4.8%------------ ------------------------------------------------------------------------------------Summary:classic wins , while rolling wins
Data processing callback delay 5 MS: Rolling Curl, performance improvement of about 40%.
------------------------------------------------------------------------------------------------delay:5000 Micro seconds, equals to 5 milli seconds---------------------------------------------------------------------------- --------------------Counter Classic Rolling Winner excellence Excel. %------------------------------------------------------------------------------------------------1 0.0658 0.0352 rolling 0.0306 46.5% 2 0.0728 0.0367 0.0361 49.59% 3 0.0732 0.0387 rolling 0.0345 47.13% 4 0.0783 0.0347 Rolling 0.0436 55.68% 5 0.0658 0.0286 Rolling 0.0372 56.53% 6 0.0687 0.0362 rolling 0.0325 47.31% 7 0.0787 0.0337 Rolling 0.0450 57.18% 8 0.0676 0.0391 Rolling 0.0285 42.16% 9 0.0668 0.0351 rolling 0.0317 47.46% 10 0.0603 0.0317 rolling 0.0286 47.43% 0.0714 0.0350 Rolling 0.0364 50.98% 0.0627 0.0215 rolling 0.0412 65.71% 13 0.0617 0.0401 rolling 0.0216 35.01% 14 0.0721 0.0226 Rolling 0.0495 68.65% 0.0701 0.0428 rolling 0.0273 3 8.94% 0.0674 0.0352 Rolling 0.0322 47.77% 17 0.0452 0.0425 rolling 0.0027 5.97% 0.0596 0.0366 rolling 0 .0230 38.59% 19 0.0679 0.0480 rolling 0.0199 29.31% 20 0.0657 0.033 8 Rolling 0.0319 48.55%------------------------------------------------------------------------ ------------------------Average 0.0671 0.0354 rolling 0.0317 47.24%---------- --------------------------------------------------------------------------------------Summary:classic wins 0 Times, while rolling wins
By comparing the performance above, the rolling curl should be more selective in the processing of the URL queue concurrency, which can control the maximum length of concurrent queues, such as 20, when the concurrency is very large (1000+). Once the 1 URLs are returned and processed, the 1 URLs that have not yet been requested are added to the queue, so that the code that is written will be more robust and not be too large to die or crash. Detailed implementation please refer to: http://code.google.com/p/rolling-curl/