This is a series no way to write in a day or two so an article published
General outline:
Single page acquisition function of 1.curl data collection series get_html
Multi-page Parallel acquisition function of 2.CURL data acquisition series GET_HTMLS
The regular processing function of the 3.curl data collection series Get _matches
Code separation of 4.curl data collection series
Parallel logic control function Web_spider of 5.curl data collection series
Single page acquisition in the data acquisition process is the most commonly used a function sometimes in the case of server access restrictions can only use this collection method slow but simple control so it is important to write a common curl function call.
Baidu and NetEase more familiar so take these two site home collection to do examples to explain
The simplest wording:
Copy Code code as follows:
$url = ' http://www.baidu.com ';
$ch = Curl_init ($url);
curl_setopt ($ch, curlopt_returntransfer,true);
curl_setopt ($ch, curlopt_timeout,5);
$html = curl_exec ($ch);
if ($html!== false) {
Echo $html;
}
Curl_setopt_array can be written as a function because of its frequent use:
Copy Code code as follows:
function get_html ($url, $options = Array ()) {
$options [Curlopt_returntransfer] = true;
$options [Curlopt_timeout] = 5;
$ch = Curl_init ($url);
Curl_setopt_array ($ch, $options);
$html = curl_exec ($ch);
Curl_close ($ch);
if ($html = = False) {
return false;
}
return $html;
}
Copy Code code as follows:
$url = ' http://www.baidu.com ';
echo get_html ($url);
Sometimes need to pass some specific parameters to get the right page like now to get NetEase page:
Copy Code code as follows:
$url = ' http://www.163.com ';
echo get_html ($url);
You'll see a blank, nothing. Then use Curl_getinfo to write a function to see what happens:
Copy Code code as follows:
function Get_info ($url, $options = Array ()) {
$options [Curlopt_returntransfer] = true;
$options [Curlopt_timeout] = 5;
$ch = Curl_init ($url);
Curl_setopt_array ($ch, $options);
$html = curl_exec ($ch);
$info = Curl_getinfo ($ch);
Curl_close ($ch);
return $info;
}
$url = ' http://www.163.com ';
Var_dump (Get_info ($url));
You can see the Http_code 302 Redirect and you need to pass some arguments:
Copy Code code as follows:
$url = ' http://www.163.com ';
$options [Curlopt_followlocation] = true;
Echo get_html ($url, $options);
Will find out how it is such a page and our computer access to different???
It seems the parameters are not enough. The server determines what our client is on the device and returns to the normal version.
Looks like we're sending useragent.
Copy Code code as follows:
$url = ' http://www.163.com ';
$options [Curlopt_followlocation] = true;
$options [Curlopt_useragent] = ' mozilla/5.0 (Windows NT 6.1; rv:19.0) gecko/20100101 firefox/19.0 ';
Echo get_html ($url, $options);
OK Now the page has come out so basic this get_html function can basically achieve this kind of extended function
Of course, there are other ways to achieve, when you clearly know NetEase's web page can be simple to collect:
Copy Code code as follows:
$url = ' http://www.163.com/index.html ';
echo get_html ($url);
This can also be a normal collection