Use of the single page collection function get_html Based on curl data collection

Source: Internet
Author: User

This is a series that cannot be written in one or two days.

Outline:

1. Single-page collection function of curl data collection series get_html

2. Parallel page collection function get_htmls for multiple curl data collection Series

3. Regular Expression Processing Function get _ matches of curl data collection Series

4. Code separation of curl data collection Series

5. Parallel Logic Control Function web_spider of the curl data collection Series

Single page collection is the most common feature in the data collection process. Sometimes, when server access is restricted, this collection method can only be used slowly, but can be easily controlled. Therefore, a common curl function call can be written. is very important

Baidu and Netease are familiar, so we will use the home page collection of these two websites as an example to explain.

The simplest method:

Copy codeThe Code is as follows: $ url = 'HTTP: // www.baidu.com ';
$ Ch = curl_init ($ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ ch, CURLOPT_TIMEOUT, 5 );
$ Html = curl_exec ($ ch );
If ($ html! = False ){
Echo $ html;
}

Due to frequent usage, you can use curl_setopt_array to write functions:Copy codeThe Code is as follows: function get_html ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
Curl_close ($ ch );
If ($ html = false ){
Return false;
}
Return $ html;
}

Copy codeThe Code is as follows: $ url = 'HTTP: // www.baidu.com ';
Echo get_html ($ url );

Sometimes you need to pass some specific parameters to get the correct page. For example, you need to get the Netease page:Copy codeThe Code is as follows: $ url = 'HTTP: // www.163.com ';
Echo get_html ($ url );

We can see that nothing is blank, so we can use curl_getinfo to write a function to see what happened:Copy codeThe Code is as follows: function get_info ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
$ Info = curl_getinfo ($ ch );
Curl_close ($ ch );
Return $ info;
}
$ Url = 'HTTP: // www.163.com ';
Var_dump (get_info ($ url ));

We can see that some parameters need to be passed when http_code 302 is redirected:Copy codeThe Code is as follows: $ url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
Echo get_html ($ url, $ options );

How is this page different from accessing our computer ???

It seems that the parameter is not enough. The server returns a common version when determining the device on which our client is located.

It seems that you want to send the USERAGENTCopy codeThe Code is as follows: $ url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
$ Options [CURLOPT_USERAGENT] = 'mozilla/5.0 (Windows NT 6.1; rv: 19.0) Gecko/20100101 Firefox/123456 ';
Echo get_html ($ url, $ options );

Okay, now the page has come out. Basically, this get_html function can basically implement this extended function.

Of course, there are other ways to achieve this. When you know Netease's web page clearly, you can simply collect it:

Copy codeThe Code is as follows: $ url = 'HTTP: // www.163.com/index.html ';
Echo get_html ($ url );

In this way, the data can be collected normally.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.