The single-page collection function get_html and curlget_html of the curl data collection Series

Source: Internet
Author: User

The single-page collection function get_html and curlget_html of the curl data collection Series

When collecting data, you often need to use the curl + Regular Expression Method to collect the required data. Based on your work experience, you can take some commonly used custom functions you have written to the blog center to share if you have written them. please give me more advice on inappropriate places

This is a series that cannot be written in one or two days.

Outline:

1. Single-page collection function of curl data collection series get_html

2. Parallel page collection function get_htmls for multiple curl data collection Series

3. Regular Expression Processing Function get _ matches of curl data collection Series

4. Code separation of curl data collection Series

5. Parallel Logic Control Function web_spider of the curl data collection Series

,,,

Single page collection is the most common feature in the data collection process. Sometimes, when server access is restricted, this collection method can only be used slowly, but can be easily controlled. Therefore, a common curl function call can be written. is very important

Baidu and Netease are familiar, so we will use the home page collection of these two websites as an example to explain.

 

The simplest method:

1 $url = 'http://www.baidu.com';2 $ch = curl_init($url);3 curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);4 curl_setopt($ch,CURLOPT_TIMEOUT,5);5 $html = curl_exec($ch);6 if($html !== false){7     echo $html;8 }

Due to frequent usage, you can use curl_setopt_array to write functions:

 1 function get_html($url,$options = array()){ 2     $options[CURLOPT_RETURNTRANSFER] = true; 3     $options[CURLOPT_TIMEOUT] = 5; 4     $ch = curl_init($url); 5     curl_setopt_array($ch,$options); 6     $html = curl_exec($ch); 7     curl_close($ch); 8     if($html === false){ 9         return false;10     }11     return $html;12 }
1 $url = 'http://www.baidu.com';2 echo get_html($url);

Sometimes you need to pass some specific parameters to get the correct page. For example, you need to get the Netease page:

1 $url = 'http://www.163.com';2 echo get_html($url);

We can see that nothing is blank, so we can use curl_getinfo to write a function to see what happened:

 1 function get_info($url,$options = array()){ 2     $options[CURLOPT_RETURNTRANSFER] = true; 3     $options[CURLOPT_TIMEOUT] = 5; 4     $ch = curl_init($url); 5     curl_setopt_array($ch,$options); 6     $html = curl_exec($ch); 7     $info = curl_getinfo($ch); 8     curl_close($ch); 9     return $info;10 }11 $url = 'http://www.163.com';12 var_dump(get_info($url));

We can see that some parameters need to be passed when http_code 302 is redirected:

1 $url = 'http://www.163.com';2 $options[CURLOPT_FOLLOWLOCATION] = true;3 echo get_html($url,$options);

How is this page different from accessing our computer ???

It seems that the parameter is not enough. The server returns a common version when determining the device on which our client is located.

It seems that you want to send the USERAGENT

 

1 $url = 'http://www.163.com';2 $options[CURLOPT_FOLLOWLOCATION] = true;3 $options[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';4 echo get_html($url,$options);

 

 

Okay, now the page has come out. Basically, this get_html function can basically implement this extended function.

Of course, there are other ways to achieve this. When you know Netease's web page clearly, you can simply collect it:

1 $url = 'http://www.163.com/index.html';2 echo get_html($url);

In this way, the data can be collected normally.

Come to an end today !!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.