Recently because of the needs of the project, need to do data capture, that is, with the curl-related library of functions, before this really did not touch anything so tall, and then from the beginning to today to study curl is the fourth day, write this blog record a few days of a process, There are a few things to be aware of when using curl to simulate landing data fetching, as well as introduction of a capture software for cross-platform (Windows, Linux, Mac) Charles (this software is charged, but you can use it without spending), want to get started as soon as possible, Must be familiar with two things: therole of the HTTP protocol, Curl's related parameter options
I. Introduction of CURL
This is CURL's Wikipedia address: Https://zh.wikipedia.org/wiki/CURL, interested friends can go and see
Communication protocols supported by Curl: FTP, FTPS, HTTP, HTTPS, TFTP, SFTP, Gopher, SCP, Telnet, DICT, FILE, LDAP, LDAPS, IMAP, POP3, SMTP, and RTSP
Curl can do a lot of things, to us often use a lot of is the simulation of landing, crawling some data, upload download file What, some other advanced estimates rarely contact, I recently contacted the analog landing what
Second, the use of curl
Curl is simple to use and is divided into four main parts:
1>.curl_init ();//Initialize Curl session
2>.curl_setopt ();//Set the Curl Transfer option (This step is also the most important, but also the most complex it has a lot of parameters, you can take a look at the PHP official information: http://php.net/manual/zh/ function.curl-setopt.php)
3>.curl_exec ();//Perform a curl session
4>. curl_close ();// Turn off Curl session
To use good curl, you have to look at curl-related transmission parameters (http://php.net/manual/zh/function.curl-setopt.php), do not need to remember, as long as there is an impression on the line, the usual few remember the line
Third, curl simulation landing
1. Preparation: Since it is a simulated landing, it will be like that, to simulate the browser login request, so we want to simulate landing a site before, we need to capture the site, to analyze the site when the landing of what parameters are passed, Pass the parameters are encrypted, what is the encryption, send the request Heder inside are what, the way of transmission is what, what communication protocol (HTTP, HTTPS), whether there is a picture verification code, access to the login address is redirected 302 and so on, you're going to have to look at these things and get them all figured out. , you can do a mock landing (of course, first you have to have this site's account password, or what you take to use Curl simulation landing, right?) )
2. Start Stage:
A. To obtain a cookie, this step is crucial, you have to simulate the login must first obtain the site's cookie, before the forgery of User-agent can capture data, but now is not, no cookie is equivalent to no identity, you do not have identity, That site definitely rejects any of your actions, simply say that's a sign, here's the code to get the cookie:
#1. Obtaining Cookies $cookie _file=dirname(__file__) . '/cookie.txt ';//files that save cookies $login _url= "Http://xxxxxxxxxxxxxx";//landing Page URL $cookie _curl=Curl_init (); $timeout= 5; curl_setopt ($cookie _curl, Curlopt_url,$login _url); curl_setopt ($cookie _curl, Curlopt_returntransfer, 1);//The information obtained by CURL_EXEC () is returned as a string, rather than as a direct output. curl_setopt ($cookie _curl, Curlopt_connecttimeout,$timeout);//The number of seconds to wait while trying to connect. Set to 0, then wait indefinitelycurl_setopt ($cookie _curl, Curlopt_cookiejar,$cookie _file);//get cookie and store, in the execution Curl_close connection Introduction, save the obtained cookie file $contents= Curl_exec ($cookie _curl); Curl_close ($cookie _curl);
B. Get the Image Verification code ( for a website without a picture verification code can be ignored ), this step is also an important link, if the image verification code is not properly obtained, you are not able to log on successfully
#2. Get the verification code $cookie _file=dirname(__file__) . '/cookie.txt ';//files that save cookies $verify _code_url= "Http://xxxxxxxxxxxx";//get the image captcha URL $verify _code_referer= "Http://xxxxxxxxxxxx";//landing Page URL $verify _curl=Curl_init (); curl_setopt ($verify _curl, Curlopt_url,$verify _code_url); curl_setopt ($verify _curl, Curlopt_cookiefile,$cookie _file);//The first step to get the cookie filecurl_setopt ($verify _curl, Curlopt_header, 0);//when enabled, the information for the header file is output as a data stream. curl_setopt ($verify _curl, Curlopt_httpheader, Array ($login _url_header[' 2 '));//Header data required for verification codecurl_setopt ($verify _curl, Curlopt_referer,$verify _code_referer);//in the HTTP request Header "Referer:" The contents of the access sourcecurl_setopt ($verify _curl, Curlopt_returntransfer, 1);//The information obtained by CURL_EXEC () is returned as a string, rather than as a direct output. $img= Curl_exec ($verify _curl); Curl_close ($verify _curl); $fp=fopen("Verifycode.jpg", "w");//writes the obtained verification code to the picture fwrite($fp,$img); fclose($fp);
Note: (1). Get the image captcha URL is usually followed by a parameter, that is the timestamp (what millisecond level, China standard Time), it may be necessary for the function to deal with these parameters, I encountered the need to use urlencode curlopt_httpheader curlopt_referer , curlopt_useragent These parameters, when you try the above method does not work, try to put the values of these parameters to try; (3). Here to get a picture verification code, some people on the Internet to use code, pause for 20 seconds, we artificially to see the value of this image verification code, Then write to a TXT file, and then use file_get_contents to read the verification code we filled out, of course, you can also use other methods, how convenient to come (code: sleep); $code = file_get_contents (". /code_bj.txt ");).
C. Splicing the data to be sent, and whether the data is encrypted, how to use encryption, these need to be processed, the data here can not be a two-dimensional array , can only be a one-dimensional array (key=>value), or " &username=zhangsan&password=123456&code=2z2s", the above two formats are OK, in the transfer of data, if it is a one-dimensional array format, you need to use the function http_build_query The corresponding processing, the code is as follows:
//the arguments submitted are in two formats #1. One-dimensional array (must not be a two-dimensional array) $post=Array( ' Username ' = ' zhangsan ', ' password ' = ' 123456 ', ' code ' = ' 2z2s ', ); #1.1 One-dimensional array format needs to be handled using functions (http_build_query) during transportcurl_setopt ($curl, Curlopt_post, 1);//Post mode submissioncurl_setopt ($curl, Curlopt_postfields,Http_build_query($post)); #2. String Form $post= "&username=zhangsan&password=123456&code=2z2s"; #2.1 One-dimensional array format needs to be handled using functions (http_build_query) during transportcurl_setopt ($curl, Curlopt_post, 1);//Post mode submissioncurl_setopt ($curl, Curlopt_postfields,$post);
3. Analog login, the code is as follows:
#3. Analog login $cookie _file=dirname(__file__) . '/cookie.txt ';//files that save cookies $submit _url= ' http://xxxxxxxxxxxx ';//the URL of the data submission (the address of the form form submission data) $submit _referer= "Http://xxxxxxxxxxxx";//landing Page URL $submit _curl= Curl_init ();//Initializing the Curl modulecurl_setopt ($submit _curl, Curlopt_url,$submit _url);//Address of the login submissioncurl_setopt ($submit _curl, Curlopt_header, 0);//whether to display header informationcurl_setopt ($submit _curl, Curlopt_returntransfer, 1);//The information obtained by CURL_EXEC () is returned as a string, rather than as a direct output. curl_setopt ($submit _curl, Curlopt_cookiefile,$cookie _file);//Set cookie information to be saved in the specified filecurl_setopt ($submit _curl, Curlopt_referer,$submit _referer);//Sourcecurl_setopt ($submit _curl, Curlopt_useragent, "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.155 safari/537.36 ");//Routecurl_setopt ($submit _curl, Curlopt_httpheader,Array(' accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' Upgrade-insecure-requests:1 ', ' content-type:application/x-www-form-urlencoded ', ' accept-encoding:gzip, deflate ', ' accept-language:zh-cn,zh;q=0.8 ', ' content-length: '.strlen($post))); curl_setopt ($submit _curl, Curlopt_post, 1);//Post submission (the data format I used here is the string concatenation format)curl_setopt ($submit _curl, Curlopt_postfields,$post);//the information to be submitted $contents= Curl_exec ($submit _curl);//Perform CurlCurl_close ($submit _curl);//Turn off the Curl resource and release the system resources
Interested in the output above the result set, to see what is to know, to this simulation landing success, then do what you want to do, such as the successful landing you want to get your user Personal Center account name, avatar, mobile phone number and so on information can be
4. Summary
1. In the use of curl, be sure to look at the curl related parameters, otherwise it will bring a very boring trouble (I was, put Curlopt_cookiejar and curlopt_cookiefile wrong, the results found a long time to find the reason)
2. Before the simulation landing must carefully analyze the site of the process of landing what happened, using grab kit tools, crawl data for analysis, some sites are more stringent, you have to add a few parameters:curlopt_referer (source of access, from which page came from), curlopt _useragent (Imitation of the antecedents!) Impersonation is a request from Google or a request from Firefox, Curlopt_httpheader (This is critical, is to send the requested URL of the request header--request header, to see what the parameters, when necessary, In the simulation of the request when the header parameters are all set up, the chances of success is very big Oh! ).
3. Must have patience, early to grasp the packet analysis work carefully, more look at the HTTP protocol and curl parameters, I wish you success!
For the time being to write so much, the above content is purely my own practice, if there are errors welcome criticism correct, thank you!!!
PHP Curl Simulation Login Crawl Data