PHP curl simulates logging in and fetching data

Source: Internet
Author: User

Using the php Curl Extension library, you can simulate login and crawl some data that needs to be logged in by the user's account. The implementation process is as follows (personal summary):

1. First, you need to analyze the HTML source code of the corresponding login page to obtain some necessary information:

(1) The address of the login page;

(2) The address of the verification code;

(3) The name and submission method of each field that the login form needs to submit;

(4) The address of the submission of the registration form;

(5) Also need to know the address of the data to be crawled.

2. Obtain cookies and store them (for websites that use cookie files):

$login _url = ' http://www.xxxxx '; Login Page Address

$cookie _file = dirname (__file__). "    /pic.cookie "; Cookie file storage location (custom)

$ch = Curl_init ();

curl_setopt ($ch, Curlopt_url, $login _url);

curl_setopt ($ch, Curlopt_header, 0);

curl_setopt ($ch, curlopt_returntransfer,1);

curl_setopt ($ch, Curlopt_cookiejar, $cookie _file);

Curl_exec ($ch);

Curl_close ($ch);

3. Obtain the Verification code and store it (for websites that use the CAPTCHA):

$verify _url = "http://www.xxxx"; Verification Code Address

$ch = Curl_init ();

curl_setopt ($ch, Curlopt_url, $verify _url);

curl_setopt ($ch, Curlopt_cookiefile, $cookie _file);

curl_setopt ($ch, Curlopt_header, 0);

curl_setopt ($ch, Curlopt_returntransfer, 1);

$verify _img = curl_exec ($ch);

Curl_close ($ch);

$fp = fopen ("./verify/verifycode.png", ' W '); Writes the captured picture file to the local picture file to save

Fwrite ($fp, $verify _img);

Fclose ($FP);

Description: Because the verification code can not be recognized, so I do here is to take the verification code pictures to store to

The local file is then displayed in the HTML page of your project, allowing the user to fill out, etc.

Complete the account number, password and verification code, and click the Submit button before you proceed to the next step.

4. Simulate the Submit login form:

$ post_url = ' http://www.xxxx '; Login Form submission Address

$post = "Username= $account &password= $password &seccodeverify= $verifyCode";

Data submitted by the form (depending on the form field name and user input)

$ch = Curl_init ();

curl_setopt ($ch, Curlopt_url, $ post_url);

curl_setopt ($ch, Curlopt_header, false);

curl_setopt ($ch, curlopt_returntransfer,1);

curl_setopt ($ch, Curlopt_postfields, $post); Submitted in the form of post

curl_setopt ($ch, Curlopt_cookiefile, $cookie _file);

Curl_exec ($ch);

Curl_close ($ch);

5. Crawl data:

$data _url = "http://www.xxxx"; Location of data

$ch = Curl_init ();

curl_setopt ($ch, Curlopt_url, $data _url);

curl_setopt ($ch, Curlopt_header, false);

curl_setopt ($ch, Curlopt_header, 0);

curl_setopt ($ch, curlopt_returntransfer,0);

curl_setopt ($ch, Curlopt_cookiefile, $cookie _file);

$data = curl_exec ($ch);

Curl_close ($ch);

So far, this page of the address where the data is located is captured and stored in the string variable $data.

It is important to note that crawling down is the HTML source of a Web page, meaning that the string contains not only the data you want, but also a lot of HTML tags and other things you don't want. So if you want to extract the data you need from it, you have to analyze the HTML code of the page that holds the data, and then combine the string manipulation functions, regular matching, and so on to extract the data you want from it.

The above methods are valid for general Web sites that use the HTTP protocol. But if you want to impersonate a website that uses the HTTPS protocol, you need to add the following processing:

1. Skip HTTPS authentication:

curl_setopt ($curl, Curlopt_ssl_verifypeer, false);

curl_setopt ($curl, Curlopt_ssl_verifyhost, false);

2. Use the user agent:

$UserAgent = ' mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1;. NET CLR 2.0.50727;. NET CLR 3.0.04506;. NET CLR 3.5.21022;. NET CLR 1.0.3705;. NET CLR 1.1.4322) ';

curl_setopt ($curl, curlopt_useragent, $UserAgent);

Note: If you do not add these processes, you cannot successfully impersonate a login.

Using the above program to simulate the login site is generally successful, but in fact, it is necessary for the simulation login site specific considerations. For example: Some Web site coding is different, so you crawl down the page is garbled, then you need to do a bit of code conversion, such as: $data = Iconv ("gb2312", "Utf-8", $data);, convert GBK encoding to UTF8 encoding. There are some high security requirements of the site, such as online banking, will be the verification code in an inline frame, then you need to crawl into the inline frame of the page and then extract the address of the verification code, and then to crawl the verification codes. There are some websites (such as net silver) in the JS code to submit the form, before submitting the form will do some processing, such as encryption, etc., so if you are directly submitted can not log on successfully, you have to do similar processing and then submit, but this situation if you can know the JS code in the specific operation, such as encryption, the encryption algorithm is what, you can do the same as the processing, and then to submit data, so it can be successful. But the key point is that if you don't know what it's doing, like it's encrypted, but you don't know the exact algorithm for encryption, then you can't do the same thing, and you won't be able to successfully impersonate the login. The typical case of this is the net silver, which uses the net silver control to do some processing to the user's password and verification code before submitting the form in the JS code, but we have no idea what it is doing, so we can't simulate it. So if you think you read this article after you can impersonate the network silver, then you are too naïve, the bank of the site can be so easy to be your simulation login? Of course, if you can crack the net-silver control, that's another story. Then again, why I feel so deep, because I met this problem, do not say, said more are tears ah ...

PHP curl simulates logging in and fetching data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.