Use of the single-page parallel collection function get_htmls based on curl data collection

Source: Internet
Author: User
Use get_html () in the first article to implement simple data collection. because the data is collected one by one, the transmission time is the total download time of all pages. assume that one page is 1 second, then 10 pages are 10 seconds. Fortunately, curl also provides the parallel processing function. in the first article, get_html () is used for simple data collection, the transmission time of data collection is the total download duration of all pages. if a page is 1 second, 10 pages are 10 seconds. Fortunately, curl also provides the parallel processing function.

To write a function for parallel collection, you must first understand what pages to collect and what requests to the collected pages to write a relatively common function.


Function requirement analysis:

What is returned?

Of course, the html set of each page is merged into an array

What parameters are passed?

When writing get_html (), we know that more curl parameters can be transmitted using the options array. Therefore, the compilation of the collection function on multiple pages must be preserved.

What type of parameters?

Whether it is to request the webpage HTML or call the Internet api, the get and post parameters always request the same page or interface, but the parameters are different. The parameter type is:

Get_htmls ($ url, $ options );

$ Url is string

$ Options is a two-dimensional array. each page parameter is an array.

In this case, it seems that the problem has been solved. However, I searched the curl manual and did not see where the get parameter is passed. Therefore, only $ url is transmitted in the form of an array and a method parameter is added.


The function prototype defines get_htmls ($ urls, $ options = array, $ method = 'get'). The code is as follows:
The code is as follows:
Function get_htmls ($ urls, $ options = array (), $ method = 'get '){
$ Mh = curl_multi_init ();
If ($ method = 'get') {// get is the most common method for passing values.
Foreach ($ urls as $ key => $ url ){
$ Ch = curl_init ($ url );
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
Curl_setopt_array ($ ch, $ options );
$ Curls [$ key] = $ ch;
Curl_multi_add_handle ($ mh, $ curls [$ key]);
}
} Elseif ($ method = 'post') {// pass the value in post mode
Foreach ($ options as $ key => $ option ){
$ Ch = curl_init ($ urls );
$ Option [CURLOPT_RETURNTRANSFER] = true;
$ Option [CURLOPT_TIMEOUT] = 5;
$ Option [CURLOPT_POST] = true;
Curl_setopt_array ($ ch, $ option );
$ Curls [$ key] = $ ch;
Curl_multi_add_handle ($ mh, $ curls [$ key]);
}
} Else {
Exit ("Parameter error! \ N ");
}
Do {
$ Mrc = curl_multi_exec ($ mh, $ active );
Curl_multi_select ($ mh); // reduce CPU pressure comment out greater CPU pressure
} While ($ active );
Foreach ($ curls as $ key => $ ch ){
$ Html = curl_multi_getcontent ($ ch );
Curl_multi_remove_handle ($ mh, $ ch );
Curl_close ($ ch );
$ Htmls [$ key] = $ html;
}
Curl_multi_close ($ mh );
Return $ htmls;
}

Common get requests are implemented by modifying url parameters, because our functions are for data collection. It must be classified collection, so the URL is similar to this:

Http://www.baidu.com? Wd = shili & pn = 0 & ie = UTF-8

Http://www.baidu.com? Wd = shili & pn = 10 & ie = UTF-8

Http://www.baidu.com? Wd = shili & pn = 20 & ie = UTF-8

Http://www.baidu.com? Wd = shili & pn = 30 & ie = UTF-8

Http://www.baidu.com? Wd = shili & pn = 50 & ie = UTF-8

The above five pages are quite regular and only change the value of pn.
The code is as follows:
$ Urls = array ();
For ($ I = 1; $ I <= 5; $ I ++ ){
$ Urls [] = 'http: // www.baidu.com/s? Wd = shili & pn = '. ($ i-1) * 10).' & ie = UTF-8 ';
}
$ Option [CURLOPT_USERAGENT] = 'mozilla/5.0 (Windows NT 6.1; rv: 19.0) Gecko/20100101 Firefox/123456 ';
$ Htmls = get_htmls ($ urls, $ option );
Foreach ($ htmls as $ html ){
Echo $ html; // you can obtain html here for data processing.
}

Simulate common post requests:

Write a post. php file as follows:
The code is as follows:
If (isset ($ _ POST ['username']) & isset ($ _ POST ['password']) {
Echo 'user name: '. $ _ POST ['username'].' password: '. $ _ POST ['password'];
} Else {
Echo 'request error! ';
}

Call the following code:
The code is as follows:
$ Url = 'http: // localhost/yourpath/post. php'; // Here is your path
$ Options = array ();
For ($ I = 1; $ I <= 5; $ I ++ ){
$ Option [CURLOPT_POSTFIELDS] = 'username = user'. $ I. '& password = pass'. $ I;
$ Options [] = $ option;
}
$ Htmls = get_htmls ($ url, $ options, 'post ');
Foreach ($ htmls as $ html ){
Echo $ html; // you can obtain html here for data processing.
}

In this way, the get_htmls function can basically implement some data collection functions.

I am not clear about what I have written here today. please give me more advice.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.