PHPCURL synchronous/asynchronous concurrent crawling

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PHPCURL Synchronous Asynchronous concurrent collection crawling code is organized on the basis of the previous article, and merged into the synchronous collection method class using another Proxy class, please implement it on your own, if you do not need to use a Proxy, please remove the code for synchronous calling & nbsp; $ this-& gt; catchernewLCatcher (10, 2 php curl synchronous/asynchronous concurrent collection crawling

Based on the previous article, I sorted out the code and merged it into the synchronous collection method.

Use another Proxy class. if you do not need a Proxy, remove the relevant code.

Synchronous call diagram

$ This-> catcher = new LCatcher (10, 20, false, true, true );

List ($ code, $ content) = $ this-> catcher-> get ($ url );

Concurrent collection

$ This-> catch = new LCatcher (10, $ asynNum, true );

$ This-> catch-> pushJob ($ this, $ url, 9 );

The complete code is as follows:

/**
* Concurrent asynchronous collectors
*/
Class LCatcher {
// The following are the running parameters to be configured:
Public $ timeout = 10; // The default timeout value is 10 seconds.
Public $ useProxy = true; // whether to use the proxy server
Public $ concurrentNum = 20; // number of concurrent jobs
Public $ autoUserAgent = true; // whether to automatically replace the UserAgent
Public $ autoFollow = false; // whether to automatically jump to 301/302

/**
* Create a collector
* @ Param number $ timeout
* @ Param number $ concurrentNum concurrency
* @ Param string $ useProxy whether to use proxy
*/
Public function _ construct ($ timeout = 10, $ concurrentNum = 20, $ useProxy = true, $ autoFollow = false, $ autoUserAgent = true ){
$ This-> timeout = $ timeout;
$ This-> concurrentNum = $ concurrentNum;
$ This-> useProxy = $ useProxy;
$ This-> autoFollow = $ autoFollow;
$ This-> autoUserAgent = $ autoUserAgent;
}

/**
* Serial collection
*
* @ Param unknown $ url
* Address to be collected
* @ Param string $ must
* Are you sure the other party exists (200, and)
* @ Param string $ referer
* @ Return multitype: NULL mixed blocks the current process until data is collected.
*/
Public function get ($ url, $ must = true, $ iconv = true, $ referer = false ){
$ Url = trim ($ url );
Static $ lastUrl;
Echo "\ r \ nURL: $ url \ r \ n ";

If ($ referer === true ){
$ Referer = $ lastUrl;
} Elseif (! $ Referer ){
$ Referer = '';
}

// Wait until it succeeds or gives up
While (true ){
List ($ ch, $ proxy) = $ this-> createHandle ($ url, $ referer );

// Start crawling
$ Begin = microtime (true );
$ Content = curl_exec ($ ch );
$ Code = curl_getinfo ($ ch, CURLINFO_HTTP_CODE );
$ End = microtime (true );
$ Errno = curl_errno ($ ch); // error code
$ Error = curl_error ($ ch); // error message

// Close the connection
Curl_close ($ ch );

// Error. it should be a proxy problem, which cannot occur on the origin site
If ($ errno or $ code> = '000000' or! $ Content or $ code = '20180101' or $ code = '20180101' or $ code = '20180101' or $ code =' 407 ') {
// This proxy flag failed
If ($ this-> useProxy ){
LProxy: failure ($ proxy );
}

// Display the error message
If ($ errno ){
If ($ errno = 28 ){
$ Error = 'timeout of '. $ this-> timeout. s ';
}
Echo "\ r \ nProxy: $ proxy \ r \ n ";
Echo "Curl error: $ errno ($ error) \ r \ n ";
Continue;
}

If ($ code> = '20140901 '){
Echo "\ r \ nProxy: $ proxy \ r \ n ";
Echo "Http Code: $ code \ r \ n ";
Continue;
}
}

If ($ must and ($ code! = 200 or! Strpos ($ content ,''))){
If ($ code! = 200 ){
Echo "\ r \ nProxy: $ proxy \ r \ n ";
Echo "Http Code: $ code \ r \ n ";
Continue;
}
If (! Strpos ($ content ,'')){
Echo "\ r \ nProxy: $ proxy \ r \ n ";
Echo "Not End, Length:". strlen ($ content). "\ r \ n ";
Continue;
}
}

// Succeeded
Break;
}

// The capture is successful.
If ($ this-> useProxy ){
LProxy: success ();
}

If ($ iconv ){
$ Content = self: iconv ($ content );
}
Echo 'http Code :'. $ code. "\ tUsed :". round ($ end-$ begin, 2 ). "\ tLength :". strlen ($ content ). "\ r \ n ";

$ LastUrl = $ url;

// Return results
Return array (
$ Code,
$ Content
);
}

// Task stack, which has the highest priority of 0
Private $ jobStack = array ();

/**
* Add an asynchronous task
* @ Param: the callback object $ obj must have the callback method.
* @ Param Collection Address $ url
* @ Param number $ major priority, 0 highest
* @ Param string $ whether iconv is encoded (gb-> utf8)
* @ Param string $ whether referer specifies a REFERER
*/
Public function pushJob ($ obj, $ url, $ major = 0, $ iconv = true, $ referer = false ){
$ Major = max (99, intval ($ major ));
If (! Isset ($ this-> jobStack [$ major]) {
$ This-> jobStack [$ major] = array ();
}
$ This-> jobStack [$ major] [] = array (
'Obj '=> $ obj,
'URL' => $ url,
'Iconv' => $ iconv,
'Referer' => $ referer
);
Return $ this;
}

// Handle set being collected
Private $ map = array ();

// Total collection handle
Private $ chs;

/**
* Create a capture handle
* @ Param unknown $ url address to be crawled
* @ Param string $ referer
* @ Return multitype: resource Ambigous
*/
Private function createHandle ($ url, $ referer = ''){
// Construct a handle
$ Ch = curl_init ($ url );

// Construct the configuration
$ Opt = array (
CURLOPT_RETURNTRANSFER => true, // The Returned result is required.
CURLOPT_TIMEOUT => $ this-> timeout, // timeout
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1, // http 1.1 Protocol
CURLOPT_REFERER => $ referer, // Last Page
CURLOPT_COOKIE => '', // No COOKIE
CURLOPT_FOLLOWLOCATION => $ this-> autoFollow, // whether to automatically jump to 301/302
CURLOPT_USERAGENT =>$ this-> autoUserAgent? $ This-> agents [rand (0, count ($ this-> agents)-1)]: '', // whether to randomly obtain a user AGENT

// The following is the Header. capture the Header data of a normal request using FireBug or the like.
CURLOPT_HTTPHEADER => array (
"Accept: text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, *". "/*; q = 0.8 ",
"Accept-Language: zh-CN, zh; q = 0.8 ",
"Connection: keep-alive"
)
);

// Set CURL parameters
Curl_setopt_array ($ ch, $ opt );

// Determine whether a proxy server is used
If ($ this-> useProxy ){
$ Proxy = LProxy: get ();
If (! $ Proxy ){
Dump ('no valid proxy'); exit;
}
Curl_setopt ($ ch, CURLOPT_PROXY, $ proxy );
} Else {
$ Proxy = '';
}
Return array ($ ch, $ proxy );
}

/**
* Obtain the task from the stack to be collected and add the task set being collected
*/
Private function fillMap (){
// Retrieves information from the list to be processed to the list being processed
While (count ($ this-> map) <$ this-> concurrentNum ){
$ Job = false;

// Starts with a high priority.
$ Keys = array_keys ($ this-> jobStack );
Sort ($ keys );
Foreach ($ keys as $ I ){
If (! Isset ($ this-> jobStack [$ I]) or! Count ($ this-> jobStack [$ I]) {
Continue;
}
$ Job = array_pop ($ this-> jobStack [$ I]);
}

// No tasks to be processed
If (! $ Job ){
Break;
}

List ($ ch, $ proxy) = $ this-> createHandle ($ job ['URL'], $ job ['referer']);

$ Job ['proxy'] = $ proxy;

// Add it to the total handle
Curl_multi_add_handle ($ this-> chs, $ ch );

// Record to the processing handle
$ This-> map [strval ($ ch)] = $ job;
}
Return;
}

/**
* Process a collected task
* @ Param unknown $ done
*/
Private function done ($ done ){
$ Ch = $ done ['handle']; // subcollection handle
$ Errno = curl_errno ($ ch); // error code
$ Error = curl_error ($ ch); // error message
$ Code = curl_getinfo ($ ch, CURLINFO_HTTP_CODE); // HTTP CODE

// Retrieve from the running task set
$ Job = $ this-> map [strval ($ ch)];

// URL of the page
$ Url = $ job ['URL'];

// Collect information during the process
$ ChInfo = curl_getinfo ($ ch );

// Collected content
$ Result = curl_multi_getcontent ($ ch );

// If the task requires automatic transcoding
If ($ job ['iconv']) {
$ Result = self: iconv ($ result );
}

// Content length
$ Length = strlen ($ result );

// Error. it should be a proxy problem, which cannot occur on the origin site
If ($ errno or $ code> = '000000' or $ length = 0 or $ code = '000000' or $ code = '000000' or $ code =' 401 'or $ code = '000000' or $ code = '000000 ') {
// This proxy flag failed
If ($ job ['proxy']) {
LProxy: failure ($ job ['proxy']);
}

// Display the error message
If ($ errno ){
If ($ errno = 28 ){
$ Error = 'timeout of '. $ this-> timeout. s ';
}
Echo "\ r \ nURL: $ url \ r \ n ";
Echo "Curl error: $ errno ($ error) \ r \ n ";
} Elseif ($ code> = '20140901 '){
Echo "\ r \ nURL: $ url \ r \ n ";
Echo "Http Code: $ code \ r \ n ";
}

// Re-import this information to the stack and wait for another proxy to execute it again
$ This-> pushJob ($ job ['obj '], $ job ['URL'], 9999 );

// Continue execution
Return;
}

// An error occurs on the target website, such as 302,404.
If ($ code! = 200 ){
Echo "\ r \ nURL: $ url \ r \ n ";
Echo "Proxy:". $ job ['proxy']. "\ r \ n ";
Echo "Code: $ code Length: $ length \ r \ n ";

// Call the callback method to process the collected content
$ Job ['obj ']-> callback ($ code, $ result );

// Continue execution
Return;
}

// The capture is successful.
If ($ job ['proxy']) {
LProxy: success ($ job ['proxy']);
}
Echo "\ r \ nURL: $ url \ r \ n ";
Echo "Http Code: $ code \ tUsed:". round ($ chInfo ['total _ time'], 2). "\ tLength: $ length \ r \ n ";

// Call the callback method of the callback object to process the collected content
$ Job ['obj ']-> callback ($ code, $ result );
}

/**
* Concurrent collection starts after the task is loaded into the stack.
*/
Public function run (){
// Total handle
$ This-> chs = curl_multi_init ();

// Fill in the task
$ This-> fillMap ();

Do {// initiate a network request at the same time to continuously view the running status
Do {// if the execution status is in progress, continue to execute
$ Status = curl_multi_exec ($ this-> chs, $ active );
} While ($ status = CURLM_CALL_MULTI_PERFORM );

// Finally, there are multiple subtasks completed by the request, which can be retrieved and processed one by one.
While (true ){
$ Done = curl_multi_info_read ($ this-> chs );
If (! $ Done ){
Break;
}

// Process the retrieved content
$ This-> done ($ done );

// Remove this task
Unset ($ this-> map [strval ($ done ['handle']);
Curl_multi_remove_handle ($ this-> chs, $ done ['handle']);
Curl_close ($ done ['handle']);

// Supplement the task
$ This-> fillMap ();
};

// No more tasks. quit.
If ($ status! = CURLM_ OK or! $ Active) and! Count ($ this-> map )){
Break;
}

// Curl_multi_select ($ this-> chs, 0.5); // blocking may take about 0.5 seconds.
} While (true); // handle processing is still in progress
}

// User proxy available for random use
Private $ agents = array (
'Sogou web spider/4.0 (+ http://www.sogou.com/docs/help/webmasters.htm#07 )',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0 ;. net clr 2.0.50727 ;. net clr 3.0.20.6.2152 ;. net clr 3.5.30729 ;. net clr 1.1.4322; CBA; InfoPath.2; SE 2.X MetaSr 1.0; AskTB5.6; SE 2.X MetaSr 1.0 )',
'Ia _ archiver (+ http://www.alexa.com/site/help/webmasters; [email protected]) ',
'Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider /;)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1;. net clr 1.1.4322;. net clr 2.0.50727; SE 2.X MetaSr 1.0 )',
'Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv: 11.0) like Gecko ',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2 ;. net clr 2.0.50727 ;. net clr 3.5.30729 ;. net clr 3.0.30729; Media Center PC 6.0 )',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/80 ',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; MyIE9;. net clr 2.0.50727; InfoPath.1; SE 2.X MetaSr 1.0 )',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/8080 ',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;. net clr 2.0.50727 )',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2 )',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;. net clr 2.0.50727;. net clr 3.0.20.6.2152;. net clr 3.5.30729 )',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Trident/4.0; EmbeddedWB 14.52 from: http://www.bsalsa.com/EmbeddedWB 14.52; InfoPath.3 ;. NET4.0C ;. NET4.0E ;. net clr 2.0.50727; Shuame )',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 SE 2.X MetaSr 100 ',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/1270.1132.11 TaoBrowser/3.5 Safari/123456 ',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36 LBBROWSER ',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. net clr 2.0.50727;. net clr 1.1.4322; InfoPath.1 )',
'Mozilla/5.0 (Windows NT 5.1; rv: 27.0) Gecko/20100101 Firefox/123456 ',
'Mozilla/5.0 (compatible; JikeSpider; + http://shoulu.jike.com/spider.html )',
'Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1; DigExt )',
'Mozilla/5.0 (compatible; MJ12bot/v1.4.4; http://www.majestic12.co.uk/bot.php? + )',
'Msnbot-media/1.1 (+ http://search.msn.com/msnbot.htm )',
'User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows XP )',
'Mozilla/5.0 (compatible; CompSpyBot/1.0; + http://www.compspy.com/spider.html )',
'360spider-image ',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.21 (KHTML, like Gecko) spider6 Safari/123456 ',
'Nis Nutch Spider/Nutch-1.7 ',
'User-Agent \ x09Baiduspider ',
'Mozilla/5.0 (compatible; CompSpyBot/1.0; + http://www.compspy.com/spider.html )',
'Mozilla/5.0 (compatible; Ezooms/1.0; [email protected]) ',
'Mozilla/5.0 (compatible; + Sosospider/2.0; ++ http://help.soso.com/webspider.htm )',
'Mozilla/5.0 (compatible; YYSpider; + http://www.yunyun.com/spider.html )',
'Mozilla/5.0 (compatible; ZumBot/1.0; http://help.zum.com/inquiry )',
);

/**
* Encoding: GBK => UTF8
* @ Param unknown $ str
* @ Return string | unknown
*/
Static public function iconv ($ str ){
$ Ret = mb_convert_encoding ($ str, 'utf-8', 'gbk ');
If ($ ret ){
Return $ ret;
}
Return $ str;
}

Static public function mid ($ content, $ begin = false, $ end = false ){
If ($ begin! = False ){
$ B = mb_stripos ($ content, $ begin );
If ($ B = false ){
Return false;
}
$ Content = mb_substr ($ content, $ B + strlen ($ begin ));
}
If ($ end! = False ){
$ E = mb_stripos ($ content, $ end );
If ($ e = false ){
Return false;
}
If ($ e = 0 ){
$ Content = '';
} Elseif ($ e! = False ){
$ Content = mb_substr ($ content, 0, $ e );
}
}
Return $ content;
}
/**
* Remove hard-coded characters
* @ Param unknown $ content
* @ Return mixed
*/
Static public function filterHard ($ content ){
$ Content = preg_replace ('/& [^;] *;/', '', $ content );
Return $ content;
}

/**
* Remove all HTML tags
* @ Param string $ content
* @ Return mixed
*/
Static public function filterTag ($ content ){
Return preg_replace ('/<[^>] *>/', '', $ content );
}

/**
* Remove HTML comments, IFRAME, scripts, hyperlinks, and 39 specified tags
* @ Param unknown $ content
* @ Return mixed
*/
Static public function filterDetail ($ content ){
$ Content = preg_replace ('/(<\! \-\-.*? \->)/Sm ', '', $ content );
$ Content = preg_replace (
Array (
'/<\! \-\-.*? \->/Sm ',
'/ /Smi ',
'/ /Smi ',
'/] *>/Mi ',
'/ ] *>/Mi ',
'/ ] *>/I ',
'/] *>/I ',
'/] * Src = "http: \/www.wtai.cn \/[^>] *>/I ',
'/] * Src = "http: \/www. aids \ -china.com \/[^>] *>/I ',
), '', $ Content
);

$ Content = str_replace (array ('','

','',), '', $ Content );
Return $ content;
}

Static public function filterFrameAndScript ($ content ){
For ($ I = 0; $ I <2; $ I ++ ){
$ Content = preg_replace (
Array (
'/<\! \-\-.*? \->/Sm ',
'/ /Smi ',
'/ /Smi ',
'/ ] *> \ S * <\/p>/im ',
'/ /Smi ',
), '', $ Content
);
}
Return $ content;
}
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHPCURL synchronous/asynchronous concurrent crawling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHPCURL synchronous/asynchronous concurrent crawling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support