Several Methods for crawling pages with php curl

Source: Internet
Author: User

Curl mainly captures data. Of course, we can use other methods to capture data, such as fsockopen and file_get_contents. However, you can only capture those pages that can be accessed directly. It is more difficult to capture pages with page access control, or to log on to pages after logon.

Is to put the PHP home page back into a file.

Example 1. Use the php curl module to retrieve the PHP Homepage

The Code is as follows: Copy code
<? Php
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, "http: // localhost/mytest/phpinfo. php ");
Curl_setopt ($ ch, CURLOPT_HEADER, false );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // if this line is commented out, it will be output directly.
$ Result = curl_exec ($ ch );
Curl_close ($ ch );


2. Use a proxy to capture

Why is it necessary to use a proxy for crawling? Take google for example. If google's data is captured frequently in a short period of time, you will not be able to capture it. When google restricts your IP address, you can use another proxy to re-capture it.

 

The Code is as follows: Copy code
<? Php
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, "http://www.hzhuti.com ");
Curl_setopt ($ ch, CURLOPT_HEADER, false );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_HTTPPROXYTUNNEL, TRUE );
Curl_setopt ($ ch, CURLOPT_PROXY, FIG: 8080 );
// Url_setopt ($ ch, CURLOPT_PROXYUSERPWD, 'user: password'); add this
$ Result = curl_exec ($ ch );
Curl_close ($ ch );
?>

3. capture data after post

Let's talk about data submission separately, because when using curl, there are often data interactions, so it is important.

The Code is as follows: Copy code

<? Php
$ Ch = curl_init ();
/* Note that the data to be submitted cannot be a two-dimensional array or a higher
* For example, array ('name' => serialize (array ('tank', 'zhang'), 'sex' => 1, 'birth' => '123 ')
* For example, array ('name' => array ('tank', 'zhang'), 'sex' => 1, 'birth' => '123 ') this will result in an error */
$ Data = array ('name' => 'test', 'sex' => 1, 'birth' => '123 ');
Curl_setopt ($ ch, CURLOPT_URL, 'HTTP: // localhost/mytest/curl/upload. php ');
Curl_setopt ($ ch, CURLOPT_POST, 1 );
Curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ data );
Curl_exec ($ ch );
?> In upload. in the PHP file, print_r ($ _ POST); Use curl to capture the upload. php output content Array ([name] => test [sex] => 1 [birth] => 20101010)

4. Capture pages with page Access Control

Three methods of page Access Control

Three methods of page access control are shown on

Directory: apache/nginx
We often see this phenomenon.


Apache page Access Control
Why do we need such control? We want to show different things to different people and protect information. Although such protection is relatively low, it is more or less useful.

1. Use the htpasswd command to generate a permission Control File

The Code is as follows: Copy code

View copy print?
1. [zhangy @ BlackGhost test] $ htpasswd-c./access tank // generate a password file.-c is to create a new file htpasswd-h for viewing.
2. New password: // The system prompts you to enter the password.
3. Re-type new password: // duplicate password
4. Adding password for user tank
5. [zhangy @ BlackGhost test] $ cat access // view the password file
6. tank: Uj5B3qIF/BNdI // The username is in plaintext and the password is encrypted.
[Zhangy @ BlackGhost test] $ htpasswd-c./access tank // generate a password file.-c is to create a new file htpasswd-h for viewing.
New password: // enter the password
Re-type new password: // duplicate password
Adding password for user tank
[Zhangy @ BlackGhost test] $ cat access // view the password file
Tank: Uj5B3qIF/BNdI // The username is in plaintext and the password is encrypted. The password file is generated here.

Ii. Page Access Control Method

1, can be modified through httpd. conf or httpd-vhosts.conf to configure

 

The Code is as follows: Copy code

Listen 10004.
Namevirtualhost*: 10004
<VirtualHost *: 10004>
DocumentRoot "/home/zhangy/www/test"
ServerName *: 10004.
BandwidthModule On
ForceBandWidthModule On
Bandwidth all 1024000
MinBandwidth all 50000
LargeFileLimit * 500 50000
MaxConnection all 2

ErrorLog "/home/zhangy/apache/blog.51yip.com.com-error. log"
CustomLog "/home/zhangy/apache/blog.51yip.com-access. log" common
// Take a look at the following configuration
<Directory/home/zhangy/www/test>
AuthType Basic
AuthName "access test"
AuthUserFile/home/zhangy/www/test/access
Require valid-user
</Directory>

</VirtualHost>

2. We can use the. htaccess file for control.

Create a. htaccess file under the root directory of test

The Code is as follows: Copy code

[Zhangy @ BlackGhost test] $ vi. htaccess & nbsp; // open a file and add the permission content
[Zhangy @ BlackGhost test] $ cat. htaccess & nbsp; // The content of. htaccess is as follows:
AuthType Basic
AuthName "access test"
AuthUserFile/home/zhangy/www/test/access
Require valid-user

3. You can also perform access control without using password files.

The Code is as follows: Copy code

Define ('admin _ username', 'tank'); & nbsp; // ADMIN USERNAME
Define ('admin _ password', 'tank'); & nbsp; // ADMIN PASSWORD

// Log check
If (! Isset ($ _ SERVER ['php _ AUTH_USER ']) |! Isset ($ _ SERVER ['php _ AUTH_PW ']) |
$ _ SERVER ['php _ AUTH_USER ']! = ADMIN_USERNAME | $ _ SERVER ['php _ AUTH_PW ']! = ADMIN_PASSWORD ){
Header ("WWW-Authenticate: Basic realm =" access test "");
Header ("HTTP/1.0 401 Unauthorized ");

Echo & lt; EOB
& Lt; html & gt; & lt; body & gt;
& Lt; h1 & gt; Rejected! & Lt;/h1 & gt;
& Lt; big & gt; Wrong Username or Password! & Lt;/big & gt;
& Lt;/body & gt; & lt;/html & gt;
EOB;
Exit;
}

List of curl-related functions:

Curl_init-initialize a CURL session
Curl_setopt-set an option for CURL calls
Curl_exec-execute a CURL session
Curl_close-close a CURL session
Curl_version-returns the current CURL version
Curl_init-initialize a CURL session
Description
Int curl_init ([string url])
The curl_init () function initializes a new session and returns a CURL handle for use by the curl_setopt (), curl_exec (), and curl_close () functions. If the optional parameter is provided, the CURLOPT_URL option is set to the value of this parameter. You can use the curl_setopt () function for manual settings.

Example 1. initialize a new CURL session and retrieve a webpage

The Code is as follows: Copy code

$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, "http://www.zend.com /");
Curl_setopt ($ ch, CURLOPT_HEADER, 0 );
Curl_exec ($ ch );
Curl_close ($ ch );
?>


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.