This is a series that cannot be written in one or two days.
Outline:
1. Single-page collection function of curl data collection series get_html
2. Parallel page collection function get_htmls for multiple curl data collection Series
3. Regular Expression Processing Function get _ matches of curl data collection Series
4. Code separation of curl data collection Series
5. Parallel Logic Control Function web_spider of the curl data collection Series
Single page collection is the most common feature in the data collection process. Sometimes, when server access is restricted, this collection method can only be used slowly, but can be easily controlled. Therefore, a common curl function call can be written. is very important
Baidu and Netease are familiar, so we will use the home page collection of these two websites as an example to explain.
The simplest method:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
$ Ch = curl_init ($ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ ch, CURLOPT_TIMEOUT, 5 );
$ Html = curl_exec ($ ch );
If ($ html! = False ){
Echo $ html;
}
Due to frequent usage, you can use curl_setopt_array to write functions:
Copy codeThe Code is as follows:
Function get_html ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
Curl_close ($ ch );
If ($ html = false ){
Return false;
}
Return $ html;
}
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
Echo get_html ($ url );
Sometimes you need to pass some specific parameters to get the correct page. For example, you need to get the Netease page:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
Echo get_html ($ url );
We can see that nothing is blank, so we can use curl_getinfo to write a function to see what happened:
Copy codeThe Code is as follows:
Function get_info ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
$ Info = curl_getinfo ($ ch );
Curl_close ($ ch );
Return $ info;
}
$ Url = 'HTTP: // www.163.com ';
Var_dump (get_info ($ url ));
We can see thatHttp_code302After redirection, some parameters need to be passed:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
Echo get_html ($ url, $ options );
How is this page different from accessing our computer ???
It seems that the parameter is not enough. The server returns a common version when determining the device on which our client is located.
It seems to be sentUSERAGENT
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
$ Options [CURLOPT_USERAGENT] = 'mozilla/5.0 (Windows NT 6.1; rv: 19.0) Gecko/20100101 Firefox/123456 ';
Echo get_html ($ url, $ options );
OKNow the page has come out.Get_htmlFunctions can basically implement such extended functions.
Of course, there are other ways to achieve this. When you know Netease's web page clearly, you can simply collect it:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com/index.html ';
Echo get_html ($ url );
In this way, the data can be collected normally.