Detailed description of CURL usage and curl usage

Source: Internet
Author: User

Detailed description of CURL usage and curl usage
Detailed usage of php collection artifact CURL Author: anonymous name updated:

For those who have collected data, cURL is certainly not a stranger. Although the file_get_contents function in PHP can obtain remote link data, its controllability is too poor. For collection scenarios in various complex situations, file_get_contents seems powerless. Therefore, this article will introduce you to the use of the collection artifact cURL.

The file_get_contents function can be used to obtain remote link data.

<?php$url = "http://git.oschina.net/yunluo/API/raw/master/notice.txt";$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $url);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);$notice = curl_exec($ch);echo $notice;?>

This code will directly use curl to display the file content, but the problem is that curl is an extension of php, and some hosts may use curl for security purposes, nanyao php also closes curl during local debugging, so an error will occur, so this code is not advisable, so cloud has rewritten it.

<? Php if (function_exists ('curl _ init ') {$ url = "http://git.oschina.net/yunluo/API/raw/master/notice.txt"; $ ch = curl_init (); curl_setopt ($ ch, CURLOPT_URL, $ url ); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, 10); $ dxycontent = curl_exec ($ ch); echo $ dxycontent;} else {echo 'Khan! It seems that your server has not enabled curl extension and cannot receive notifications from the cloud. Please contact your host service provider to enable it. Please ignore local debugging ';}?>

The modified version is used to determine the curl extension to see whether the curl extension is enabled on the server. If the extension is enabled, the file is directly displayed. If the extension is not opened, a text prompt is displayed.
Although the problem was fixed, there was another problem. I just showed a piece of text, and I didn't use anything to do anything, so why should I write so much code ??
After some tests, we found that file_get_contents is not slower than curl in obtaining remote file content, and it may be much faster than curl extension in the case of few files, so I have rewritten the code.

<?php echo file_get_contents( "http://git.oschina.net/yunluo/API/raw/master/notice.txt" ); ?>

Tools
FireFox + Firebug
"To do well, you must first sharpen your tools ." Before analyzing the case, let's take a look at how to use Firebug to obtain the necessary information.
Open Firebug with F12. We can get the (1) interface:

1. the arrow icon is the "element selection" tool. If you click it once, the highlighted icon is displayed. When you move the mouse inside the page, the corresponding content is selected in the HTML menu, click the content to select the element, and the icon is highlighted. (2) As shown in:
Firebug viewing Element

2. Console
The printing of the console. log Series functions in JS is output here.
3. HTML
HTML content. Note that the content to be parsed is not necessarily collected here. The analysis of the content during the collection is always subject to the source code (Ctrl + U, here we can only quickly locate the element structure, and then select a special reference to locate the corresponding position in the source code.
For example, if you see a tag in HTML that is <div id = "demo" class = "demo"> Demo </div>, however, when you view the source code, you may see <div class = "demo" id = "demo"> Demo </div>, if you perform regular matching on the collected content according to the former, you will not get the result.
4. CSS
Here is the content of the CSS file
5. Script
Here is the content of the Javascript file
6. DOM
Dom node content
7. Network
The data of each request link, which is the focus of our collection and analysis, can display the parameters, request headers, and Cookie data of each request. When the page is submitted and refreshed, use persistence so that the page request content remains in the console after refreshing, as shown in (3:

In addition, Firefox also has a Tamper data extension to obtain request data, which can be installed and used if necessary.
8. Cookies
Cookie data

In figure (1), we can see that there are many optional single food items below, in which persistence is what we want to pay attention to. When selecting it, even if the submitted Form refreshes the page, the data in the following content area is retained, which is especially critical for analysis and submission of data.

Summary
When analyzing and collecting requests, we mainly care about the request data in the "network" menu. When necessary, we use "maintain" to view the request data refresh the page, you can use "clear" to clear the following content before the request.

Case Analysis
1. Simple collection
The simple collection referred to here refers to the collection of GET requests for a single page. It is easy to obtain page return results even through the file_get_contents function.

Code snippet: file_get_contents

<?php  $url = 'http://demo.zjmainstay.cn/php/curl/simple.html';  $content = file_get_contents($url);  echo $content;

CURL of code snippets

<? Php $ url = 'HTTP: // demo.zjmainstay.cn/php/curl/simple.html'; $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); // The returned data does not directly output $ content = curl_exec ($ ch); // execute and store the result curl_close ($ ch); echo $ content;

2. Collection of required parameters
In this case, some parameters need to be input in the page request, either a GET request or a POST request. In this case, we can use file_get_contents to include some parameters, but we will not show them here.

CURL GET of code snippets
This kind of request, we can choose search engine as a demonstration, for example, I Baidu search a word "PHP cURL", after entering the carriage return, we will get a similar http://www.baidu.com/s? Ie = UTF-8 & f = 8 & rsv_bp = 1 & ch = & tn = baidu & bar = & wd = PHP % 20cURL link, note that the links here may have different results for different browsers and portals, so you don't have to worry about the same link. By entering multiple keywords and observing the link, we can determine that the wd parameter is the dynamic parameter we want to input, while other parameters can remain unchanged, so we can get the following collection code.

<? Php $ keyword = 'php curl'; $ url = 'HTTP: // www.baidu.com/s? Ie = UTF-8 & f = 8 & rsv_bp = 1 & ch = & tn = baidu & bar = & wd = '. urlencode ($ keyword); $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); // The returned data does not directly output $ content = curl_exec ($ ch); // execute and store the result curl_close ($ ch); echo $ content;

Sometimes, some parameters are not necessary, at this time we can delete it, for example, the above link can only keep http://www.baidu.com/s? Ie = UTF-8 & wd = PHP % 20 cURL, ie = UTF-8 this parameter may affect the result encoding, so keep it for the moment. With this simple code, we can collect Baidu search results.

CURL POST of code snippets
For POST-type requests, we usually do not uncommon. For example, some searches are submitted using the POST method. In this case, we need to use the POST type to submit parameters. In PHP cURL, there are corresponding parameters: CURLOPT_POST and CURLOPT_POSTFIELDS. The setting of CURLOPT_POST can specify whether the current submission is in the POST mode, and CURLOPT_POSTFIELDS is used to set the submitted parameter, which can be a parameter string, it can also be a parameter array, for example:

Curl_setopt ($ ch, CURLOPT_POSTFIELDS, 'ie = UTF-8 & wd = PHP % 20cURL '); or curl_setopt ($ ch, CURLOPT_POSTFIELDS, array ('ie '=> 'utf-8', 'wd' => 'php % 20cURL ',));

The following is a POST simulation search php post search. The backend uses the Baidu keyword search. The basic principle is that the client submits a keyword to my server, my server uses this keyword to request Baidu's search and then returns the result to the client.
(4) analyze the request data using Firebug to obtain the request link and request parameters that we need to submit:

The following is our code:

<? Php $ keyword = 'php curl'; // parameter method 1 // $ post = 'wd = '. urlencode ($ keyword); // method 2 $ post = array ('wd '=> urlencode ($ keyword),); $ url = 'HTTP: // outputs $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // The returned data does not directly output curl_setopt ($ ch, CURLOPT_POST, 1 ); // send the POST data curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ post); // POST data, $ post can be an array, you can also splice $ content = curl_exec ($ ch); // execute and store the result curl_close ($ ch); var_dump ($ content );

3. Collection of referers
For some programs, it may determine the source URL. if it finds that the referer is not its own website, access is denied. At this time, we need to add the CURLOPT_REFERER parameter to simulate the path so that the program can collect data normally.

 

<? Php $ keyword = 'php curl'; // parameter method 1 // $ post = 'wd = '. urlencode ($ keyword); // method 2 $ post = array ('wd '=> urlencode ($ keyword),); $ url = 'HTTP: // response $ refer = 'HTTP: // demo.zjmainstay.cn/'; // The address $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); // The returned data does not directly output curl_setopt ($ ch, CURLOPT_REFERER, $ refer); // simulate curl_setopt ($ ch, CURLOPT_POST, 1 ); // send the POST data curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ post); // POST data, $ post can be an array, you can also splice $ content = curl_exec ($ ch); // execute and store the result curl_close ($ ch); var_dump ($ content );

The source code of search_refer.php is as follows, and a simple Referer is used to determine and intercept:

<? Php if (empty ($ _ POST ['wd ']) {exit ('Deny empty params. ');} // Referer determines if (stripos ($ _ SERVER ['HTTP _ referer'], $ _ SERVER ['HTTP _ host']) = false) {exit ('deny') ;}$ keyword = addslashes (trim (strip_tags ($ _ POST ['wd ']); $ url = 'HTTP: // www.baidu.com/s? Ie = UTF-8 & wd = '. urlencode ($ keyword); $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); // The returned data does not directly output $ content = curl_exec ($ ch); // execute and store the result curl_close ($ ch); echo $ content;

Iv. cookie-supported collection
For simulated login applications, simply submitting parameters and analog paths cannot solve the problem. At this time, we need to save or submit the corresponding Cookie parameters, the corresponding parameters are also provided in PHP cURL:
CURLOPT_COOKIE: directly submit cookie Parameters Using strings
CURLOPT_COOKIEFILE: uses the file method to submit cookie Parameters
CURLOPT_COOKIEJAR: saves the cookie data reported after submission.

The following is an example of PHP100 logon simulation:

 

<? Php header ("content-Type: text/html; charset = UTF-8"); $ cookie_file = tempnam ('. /temp ', 'cookies'); $ login_url = "http://bbs.php100.com/login.php"; $ post_fields = "cktime = 36000 & step = 2 & pwuser = username & pwpwd = password "; // submit the login form request $ ch = curl_init ($ login_url); curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ ch, CURLOPT_POST, 1); curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ post_fields); curl_setopt ($ ch, CURLOPT_COOKIEJAR, $ cookie_file ); // store the cookie data after submission curl_exec ($ ch); curl_close ($ ch); // after successful login, get bbs homepage data $ url = "http://bbs.php100.com/index.php "; $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, expires, 1); curl_setopt ($ ch, CURLOPT_COOKIEFILE, $ cookie_file ); // use the cookie data obtained after submission as the parameter $ contents = curl_exec ($ ch); curl_close ($ ch); // the transcoding display echo iconv ('gbk ', 'utf-8', $ contents );

5. Compressed web page collection (gzip)
Some friends who have never touched the page for compression are expected to be killed here, because they will find that the collected content is garbled and cannot restore data whether using iconv or powerful mb_convert_encoding, then there was no concept, and I was so crazy that I couldn't find a method ~
(5) garbled representation:

Fortunately, I did not pay for it. I still found it. It is the CURLOPT_ENCODING parameter.
For example, gzip compression is encountered when collecting news from Sohu. The following is an example:

<? Php $ url = 'HTTP: // news.sohu.com/'; $ ch = curl_init ($ url); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 ); // The returned data does not directly output curl_setopt ($ ch, CURLOPT_ENCODING, "gzip"); // specify gzip compression $ content = curl_exec ($ ch ); // execute and store the result curl_close ($ ch); echo $ content;

Manual Description: supported codes include "identity", "deflate", and "gzip ". If it is an empty string "", the request header will send all supported encoding types.
The following statement indicates that you can use curl_setopt ($ ch, CURLOPT_ENCODING, "");, but you cannot add this parameter.

Vi. Collection of SSL links
Some request links are of the https type. In this case, cURL collection may fail. In this case, we can use the var_dump (curl_error ($ ch); Method to print the error prompt, find the corresponding solution based on the error message. For example, the common SSL error prompt: SSL certificate problem: unable to get local issuer certificate. At this time, we need to use the parameters: CURLOPT_SSL_VERIFYPEER and CURLOPT_SSL_VERIFYHOST to disable SSL certificate verification, I tried to disable CURLOPT_SSL_VERIFYPEER only, so it is best to use two parameters at the same time.
The following is a sample code:

 

<? Php $ searchStr = 'rc376981638hk '; $ post = 'accion = LocalizaUno & numero = '. $ searchStr. '& ecorreo = & numeros ='; $ url = 'https: // aplicacionesweb. correos. es/localizadorenvios/track. asp '; $ ch = curl_init ($ url); // initialize curl curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // The returned data does not directly output curl_setopt ($ ch, CURLOPT_POST, 1); // send the POST data curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ post); // POST data, $ post can be an array, it can also be a splicing parameter string curl_setopt ($ ch, CURLOPT_SSL_VERIFYPEER, false); // use curl_setopt ($ ch, CURLOPT_SSL_VERIFYHOST, false) When an SSL error is reported ); // when an SSL error is reported, use $ contents = curl_exec ($ ch); // run and store the result // var_dump (curl_error ($ ch )); // usage (Collection error prompt) curl_close ($ ch); echo $ contents;

VII. proxy collection
As we all know, there are evil walls in China, so if we need to obtain some wall data, we need to use foreign proxy servers; or when we need to collect a large amount of data, you must constantly switch the IP address and use a proxy.
The proxy has several corresponding parameters in PHP cURL: CURLOPT_PROXY, CURLOPT_PROXYPORT, and CURLOPT_PROXYUSERPWD. Other parameters are not listed here.
CURLOPT_PROXY specifies the proxy IP parameter
CURLOPT_PROXYPORT specifies the proxy port parameter
CURLOPT_PROXYUSERPWD specifies the account password of the proxy to be verified, a string in the format of "[username]: [password ]"

As for getting proxy accounts, you can use them by yourself. Here is a list found online: cURL high-priority proxy.

The following is an example of proxy collection:

 

<? Php $ url = 'HTTP: // demo.zjmainstay.cn/php/curl/dump_ip.php? T = '. time (); echo "local IP :". file_get_contents ($ url ). "\ n forged IP Address:"; $ ip = '20180101. 224.1.116 '; $ port = '80'; // The Request header parameter is forged. If it is a high-concurrency proxy, $ header = array ('x-FORWARDED-FOR:' is not required here :'. $ ip, 'client-IP :'. $ ip,); $ ch = curl_init ($ url); // initialize curl curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ ch, CURLOPT_HTTPHEADER, $ header ); curl_setopt ($ ch, CURLOPT_PROXY, $ ip); curl_setopt ($ ch, CURLOPT_PROXYPORT, $ port); $ content = curl_exec ($ ch ); // execute and store the result curl_close ($ ch); echo $ content;

8. multi-thread collection
For a large number of collection tasks, multi-threaded collection provided by PHP cURL is essential to improve collection efficiency. The multithread collection examples provided in the Manual seem to be not very useful. I tested several examples from the beginning, but I found that the execution was stuck and could not be completed, I suddenly tested it again a few days ago and found that Example #1 under the curl_multi_info_read function can be executed. Its content is on $ res, but it is not printed out, yahoo's requests are slow and get stuck. Both the first two links can return normally.
However, the example at that time was not easy to use, and I found a very powerful project, CurlMulti, which encapsulated PHP cURL Multi with a sound extension, it can provide good collection support.
I will not talk much about the use of CurlMulti. The demo is provided on the official website. If you have technical difficulties in using CurlMulti, you can directly join the Q group for discussion, the author @ Ares and other collection experts will provide technical support.
The following is a simple example of PHP cURL Multi:

 

<? Php $ urls = array ("http://demo.zjmainstay.cn/php/curl/curl_multi_1.php", "http://demo.zjmainstay.cn/php/curl/curl_multi_2.php",); $ mh = curl_multi_init (); foreach ($ urls as $ I => $ url) {$ conn [$ I] = curl_init ($ url); curl_setopt ($ conn [$ I], CURLOPT_RETURNTRANSFER, 1); // curl_multi_add_handle ($ mh, $ conn [$ I]) ;}$ active = null; $ res = array (); do {$ status = curl_multi_exec ($ mh, $ active); $ info = cur Rochelle multi_info_read ($ mh); if (false! ==$ Info) {// process collected information $ res [] = array ('content' => curl_multi_getcontent ($ info ['handle']), 'info' => $ info,); curl_close ($ info ['handle']) ;}} while ($ status === CURLM_CALL_MULTI_PERFORM | $ active ); curl_multi_close ($ mh); var_dump ($ res );

9. 302 jump (301 jump)
For some applications, such as simulated logon, if a 302 jump occurs, the cookie will be lost, leading to simulated logon failure, as shown in Figure 6:

At this time, you can use:

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

For CURLOPT_FOLLOWLOCATION, the manual description is:

When enabled, the "Location:" returned by the server is put in the header and recursively returned to the server. You can use CURLOPT_MAXREDIRS to limit the number of recursive responses.
As I personally understand, in layman's terms, the next hop will continue to track access and the cookie will be retained in the header.

10. Simulate File Upload
In the curl_setopt function in the PHP manual, the following describes CURLOPT_POSTFIELDS:

All data is sent using the "POST" Operation in the HTTP protocol. To send a file, add the @ prefix to the file name and use the full path. You can use the urlencoded string like 'para1 = val1 limit 2 = val2 &... 'or use an array with the field name as the key value and field data as the value. If the value is an array, the Content-Type header is set to multipart/form-data.

For uploading files, this sentence contains two information:

1. to upload files, the post data parameters must use arrays so that the Content-Type header will be set to multipart/form-data.
2. to upload a file, add the @ prefix to the file name and use the complete path.
Therefore, you can upload a simulated file as follows:

// Upload the test.jpg file on the d Drive. The file must exist. Otherwise, the curl fails to be processed and no prompt is displayed. $ data = array ('name' => 'foo', 'file' => '@ d: /test.jpg '); $ ch = curl_init ('HTTP: // localhost/upload. php '); curl_setopt ($ ch, CURLOPT_POST, 1); curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ data); curl_exec ($ ch );

During local testing, \ (_ POST and \ $ _ FILES are printed in the upload. php file to verify whether the upload is successful, as shown in the following code: ''' <? Php print_r (\) _ POST );
Print_r ($ _ FILES );

The output result is similar:

Array ( [name] => Foo ) Array ( [file] => Array ( [name] => test.jpg [type] => application/octet-stream [tmp_name] => D:\xampp\tmp\php2EA0.tmp [error] => 0 [size] => 139999 ) )

For the value assignment of CURLOPT_POSTFIELDS, add the following description:
When an array is passed to CURLOPT_POSTFIELDS, cURL encodes the data into multipart/form-data, and then transmits a URL-encoded string, the data is encoded into application/x-www-form-urlencoded.

That is:

Curl_setopt (\ (ch, CURLOPT_POSTFIELDS, 'param1 = val1 1_m2 = val2 &... '); and curl_setopt (\) ch, CURLOPT_POSTFIELDS, array ('param1' => 'val1', 'param2' => 'val2 ',...));

The usage of such a powerful collection artifact cURL is introduced to you, hoping to help you learn.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.