The single-page collection function get_html and curlget_html of the curl data collection Series
When collecting data, you often need to use the curl + Regular Expression Method to collect the required data. Based on your work experience, you can take some commonly used custom functions you have written to the blog center to share if you have written them. please give me more advice on inappropriate places
This is a series that cannot be written in one or two days.
Outline:
1. Single-page collection function of curl data collection series get_html
2. Parallel page collection function get_htmls for multiple curl data collection Series
3. Regular Expression Processing Function get _ matches of curl data collection Series
4. Code separation of curl data collection Series
5. Parallel Logic Control Function web_spider of the curl data collection Series
,,,
Single page collection is the most common feature in the data collection process. Sometimes, when server access is restricted, this collection method can only be used slowly, but can be easily controlled. Therefore, a common curl function call can be written. is very important
Baidu and Netease are familiar, so we will use the home page collection of these two websites as an example to explain.
The simplest method:
1 $url = 'http://www.baidu.com';2 $ch = curl_init($url);3 curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);4 curl_setopt($ch,CURLOPT_TIMEOUT,5);5 $html = curl_exec($ch);6 if($html !== false){7 echo $html;8 }
Due to frequent usage, you can use curl_setopt_array to write functions:
1 function get_html($url,$options = array()){ 2 $options[CURLOPT_RETURNTRANSFER] = true; 3 $options[CURLOPT_TIMEOUT] = 5; 4 $ch = curl_init($url); 5 curl_setopt_array($ch,$options); 6 $html = curl_exec($ch); 7 curl_close($ch); 8 if($html === false){ 9 return false;10 }11 return $html;12 }
1 $url = 'http://www.baidu.com';2 echo get_html($url);
Sometimes you need to pass some specific parameters to get the correct page. For example, you need to get the Netease page:
1 $url = 'http://www.163.com';2 echo get_html($url);
We can see that nothing is blank, so we can use curl_getinfo to write a function to see what happened:
1 function get_info($url,$options = array()){ 2 $options[CURLOPT_RETURNTRANSFER] = true; 3 $options[CURLOPT_TIMEOUT] = 5; 4 $ch = curl_init($url); 5 curl_setopt_array($ch,$options); 6 $html = curl_exec($ch); 7 $info = curl_getinfo($ch); 8 curl_close($ch); 9 return $info;10 }11 $url = 'http://www.163.com';12 var_dump(get_info($url));
We can see that some parameters need to be passed when http_code 302 is redirected:
1 $url = 'http://www.163.com';2 $options[CURLOPT_FOLLOWLOCATION] = true;3 echo get_html($url,$options);
How is this page different from accessing our computer ???
It seems that the parameter is not enough. The server returns a common version when determining the device on which our client is located.
It seems that you want to send the USERAGENT
1 $url = 'http://www.163.com';2 $options[CURLOPT_FOLLOWLOCATION] = true;3 $options[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';4 echo get_html($url,$options);
Okay, now the page has come out. Basically, this get_html function can basically implement this extended function.
Of course, there are other ways to achieve this. When you know Netease's web page clearly, you can simply collect it:
1 $url = 'http://www.163.com/index.html';2 echo get_html($url);
In this way, the data can be collected normally.
Come to an end today !!