Use of the single page collection function get_html Based on curl data collection

Last Update:2013-10-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a series that cannot be written in one or two days.

Outline:

1. Single-page collection function of curl data collection series get_html

2. Parallel page collection function get_htmls for multiple curl data collection Series

3. Regular Expression Processing Function get _ matches of curl data collection Series

4. Code separation of curl data collection Series

5. Parallel Logic Control Function web_spider of the curl data collection Series

Single page collection is the most common feature in the data collection process. Sometimes, when server access is restricted, this collection method can only be used slowly, but can be easily controlled. Therefore, a common curl function call can be written. is very important

Baidu and Netease are familiar, so we will use the home page collection of these two websites as an example to explain.

The simplest method:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
$ Ch = curl_init ($ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true );
Curl_setopt ($ ch, CURLOPT_TIMEOUT, 5 );
$ Html = curl_exec ($ ch );
If ($ html! = False ){
Echo $ html;
}

Due to frequent usage, you can use curl_setopt_array to write functions:
Copy codeThe Code is as follows:
Function get_html ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
Curl_close ($ ch );
If ($ html = false ){
Return false;
}
Return $ html;
}

Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
Echo get_html ($ url );

Sometimes you need to pass some specific parameters to get the correct page. For example, you need to get the Netease page:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
Echo get_html ($ url );

We can see that nothing is blank, so we can use curl_getinfo to write a function to see what happened:
Copy codeThe Code is as follows:
Function get_info ($ url, $ options = array ()){
$ Options [CURLOPT_RETURNTRANSFER] = true;
$ Options [CURLOPT_TIMEOUT] = 5;
$ Ch = curl_init ($ url );
Curl_setopt_array ($ ch, $ options );
$ Html = curl_exec ($ ch );
$ Info = curl_getinfo ($ ch );
Curl_close ($ ch );
Return $ info;
}
$ Url = 'HTTP: // www.163.com ';
Var_dump (get_info ($ url ));

We can see thatHttp_code302After redirection, some parameters need to be passed:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
Echo get_html ($ url, $ options );

How is this page different from accessing our computer ???

It seems that the parameter is not enough. The server returns a common version when determining the device on which our client is located.

It seems to be sentUSERAGENT
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com ';
$ Options [CURLOPT_FOLLOWLOCATION] = true;
$ Options [CURLOPT_USERAGENT] = 'mozilla/5.0 (Windows NT 6.1; rv: 19.0) Gecko/20100101 Firefox/123456 ';
Echo get_html ($ url, $ options );

OKNow the page has come out.Get_htmlFunctions can basically implement such extended functions.

Of course, there are other ways to achieve this. When you know Netease's web page clearly, you can simply collect it:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.163.com/index.html ';
Echo get_html ($ url );

In this way, the data can be collected normally.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use of the single page collection function get_html Based on curl data collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use of the single page collection function get_html Based on curl data collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support