Use cURL for HTTP operations

Source: Internet
Author: User
Tags http cookie set cookie
Document directory
  • Specify "User-Agent :"
  • 3. Use "-H" to modify or add an HTTP Header
  • 4. Specify "Referer :"
  • 5. Get the returned HTTP Header
  • Set the username and password in HTTP Basic Authentication
  • 7. process HTTP Cookies
  • 8. POST form information
  • Conclusion

A simple webpage retrieval:

Curl http://mail.126.com

The above command can get the content of mail.126.com homepage and display it in standard output. Of course, we can also use output redirection to output the webpage to a file, as shown below:
Curl www.126.com> index.html
Or:
Curl www.126.com-o index.html

When using the above redirection command, pay attention to two points:
First, after the output is redirected, the command line is actually output, but the output is not the specific content of the transmission, but the transmission progress.
Second, operations like the above to obtain webpage content are actually implemented through the HTTP protocol. the standard output of this command does not contain the HTTP Header but only the HTTP Body. if we want to obtain the HTTP Header returned by the Web server, we can use the "-- dump-header" option as in the following example to output the HTTP Header returned by the server to a file.
Curl http://mail.126.com -- dump-header headers.txt> index.html

To learn more about curl, I used the packet capture tool to capture the packets sent by curl. Now the HTTP reports I found are quite simple:
Get, HTTP, 1.1
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
HOST: mail.126.com
Accept :*/*

Specify "User-Agent :"

From the message captured in the previous section, we can see that the User-Agent Used by curl to send HTTP packets is:
Curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3.

Of course, this User-Agent has no problems. the problem is that many websites now like to determine which data is sent to the client based on the User-Agent sent from the client to the client, for example, the website of China Telecom (such as the Tianji Hall) as far as I know ), this filtering technology is used. to download the required data, you must modify the User-Agent in the request. you can use the "-a" option in the command to modify UA (that is, user-agnet). For example:
Curl http://mail.126.com-A "Mozilla/4.05 [en] (X11; U; Linux 2.0.32 i586 )"

From the packet capture result, we can see that the UA has been modified:
GET, HTTP, 1.1
User-Agent: Mozilla/4.05 [en] (X11; U; Linux 2.0.32 i586)
Host: mail.126.com
Accept :*/*

Someone may ask whether a parameter can be specified with "Accept:" to replace the default setting. The answer is: there are indeed parameters that can be set with "Accept.

3. Use "-H" to modify or add an HTTP Header

"-H" can be used to modify or add an HTTP Header. you can also modify the HTTP Header content, but it is different from "-A": "-A". You can only modify the content of "User-Agent, "-H" can modify any HTTP Header. For example, the following command modifies the content of "Accept:" in the HTTP Header.
Curl http://mail.126.com-H "Accept: text/html, application/xhtml + xml, application/xml; q = 0.9, */*; q = 0.8"

From the packet capture result, we can see that Accept has been modified:

GET, HTTP, 1.1
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: mail.126.com
Accept: text/html, application/xhtml + xml, application/xml; q = 0.9, */*; q = 0.8

In fact, we can modify or add multiple HTTP headers, including "User-Agents:". The following is an example:
Curl www.126.com-H "User-Agent: UCWEB6.0"-H "Accept: text/html, application/xhtml + xml, application/xml; q = 0.9 ,*/*; q = 0.8"

From the packet capture result, we can see that both UA and Accept have been modified:

GET, HTTP, 1.1
Host: www.126.com
User-Agent: UCWEB6.0
Accept: text/html, application/xhtml + xml, application/xml; q = 0.9, */*; q = 0.8

4. Specify "Referer :"

After knowing the role of "-h", do you think "-a" has a little sense of weakness. in essence, there is also a "-e" parameter with chicken ribs ". this parameter is used to add the "Referer:" HTTP header. haha! In fact, these special parameters that can be replaced are not chicken ribs. At the very least, they can make users drop some words. Do you think Desktop shortcuts are chicken ribs? In fact, the shortcut just allows you to open fewer directories. The following is an application instance of "-e:
Curl-e "www.baidu.com"-h "User-Agent: ucweb6.0"-a "ucweb7.6" www.126.com

The following is the result of packet capture:
Get, HTTP, 1.1
HOST: www.126.com
Accept :*/*
Referer: www.baidu.com
User-Agent: ucweb6.0

In the above command, do you see any "problem? Yes. ua is set twice in this command. The "-h" parameter is used for the first time, and the "-a" parameter is used for the second time. the "-h" parameter settings take effect. in essence, the content set in "-h" has a higher priority than that set in "-a" and "-e.

5. Get the returned HTTP Header

The first section describes how to use "-- dump-header" to output the HTTP header to a file. this section describes how to directly output the HTTP header to the standard output. you can use the "-I" parameter to output the HTTP header to the standard output. for example:
Curl www.126.com-I

The output of the above command is:
HTTP/1.1 200 OK
Expires: Wed, 23 Jan 2013 09:00:48 GMT
Date: Wed, 23 Jan 2013 08:00:48 GMT
Server: nginx
Content-Type: text/html
Content-Length: 64058
Last-Modified: Fri, 18 Jan 2013 11:03:37 GMT
Cache-Control: max-age = 3600
Accept-Ranges: bytes
Age: 265
X-Via: 1.1 tjtg100: 80 (Cdn Cache Server V2.0), 1.1 gdzj19: 8105 (Cdn Cache Server V2.0), 1.1 gdcz29: 8361 (Cdn Cache Server V2.0)
Connection: keep-alive

Yes. Only HTTP Header information is displayed in the output, but HTTP Body information is not displayed. If you want to display the HTTP Body simultaneously, you need to use the "-I" parameter. For example:
Curl www.126.com-I

Set the username and password in HTTP Basic Authentication

Simply put, HTTP Basic Authentication is a user Authentication mechanism in the HTTP protocol. For webpages that support this mechanism, a logon window is usually displayed to allow you to add the user name and password, you can use the following URL to check the dialog box:
Http://api.minicloud.com.cn/statuses/friends_timeline.xml
For more details, refer to the following documents:
Http://en.wikipedia.org/wiki/Basic_authentication_scheme
The following is an example of a command that I spoof:
Curl-u licj: 123 www.baidu.com
The packet capture effect is as follows:
Get, HTTP, 1.1
Authorization: Basic bgljajoxmjm =
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
HOST: www.baidu.com
Accept :*/*
The user and password are encrypted with base64 (haha! Put it after "Authorization: Basic.

Of course, you can also directly provide the above user and password information in the URL:
Http: // username: password@www.baidu.com
This is also a way to give the user and password information (except that it cannot be prevented by a gentleman ). it should be noted that if the user and password information are directly provided in the URL, the connection to the target website through proxy may fail.

7. process HTTP Cookies

7.1 obtain cookie
In the curl user manual, the "-c" parameter can be used to record the cookies in the returned HTTP packet in the Netscape cookie format to the specified file. If the file string is, cookie Information is directly output to the standard output. but in fact, there is a problem with the "-c" parameter record function:
My curl version is: curl 7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3

The following is my test on connecting to Sina Weibo:
~ $ Curl-c-I www.weibo.com

HTTP/1.1 200 OK
Date: Mon, 28 Jan 2013 09:44:55 GMT
Server: Apache
Set-Cookie: U_TRS1 = 000000db. 2e0000ffb. 500004897.6b1727d4; path =/; expires = Thu, 26-Jan-23 09:44:55 GMT; domain = .sina.com.cn
Set-Cookie: U_TRS2 = 000000db. 2e485ffb. 51064897. bba6e8de; path =/; domain = .sina.com.cn
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
P3P: CP = "CURa ADMa DEVa PSAo PSDo our bus uni pur int dem sta pre com nav otc noi dsp cor"
DPOOL_HEADER: leto50
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset = UTF-8
SINA-LB: ewyymtiuageuewzncm91cdeuymoubg9hzgjhbgfuyw =
Set-COOKIE: usrhawb = usrmdins31450; Path =/

# Netscape HTTP cookie file
# Http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

Www.weibo.com false/false 0 usrhawb usrmdins31450

The returned message contains three cookies, but the curl actually only records the one that uses the default domain (here the default domain is www.weibo.com )! None of the cookies whose other domains are not the default values are recorded. This is not necessarily a bug, but you should pay attention to it when using it!
To save all cookies, we can use the "-- dump-header" parameter mentioned earlier. This parameter records all HTTP headers in the file. If you do not want to operate cookies on files, you can only write scripts by yourself and operate them in the memory. The following is a bash script snippet I wrote. Cookies can be separated from HTTP packets:

#! /Bin/bash

Url = $1
Line_num =
Http_body =
Http_header =
Cookies =
Response = 'curl-I $ url'
LINE_NUM = 'echo-n "$ RESPONSE" | sed-n "/^ \ r$/{=; q }"'

If (0! = $ ?)); Then
Echo "Error: Open url fail! The url is: $ URL"
Exit 1
Fi

If [-z "LINE_NUM"]; then
Echo "Error: Get HTTP header fail! "
Exit 1
Fi

# Save the HTTP header to var: $ HTTP_HEADER
HTTP_HEADER = 'echo "$ RESPONSE" | sed-n "1, $ LINE_NUM" P'

# Save Cookies to var: $ COOKIES
COOKIES = 'echo-n "$ HTTP_HEADER" | grep-I "^ set-cookie" | sed "s/^ set-cookie/Set-Cookie/ig "'

# Save HTTP body to var: $ HTTP_BODY
HTTP_BODY = 'echo-n "$ RESPONSE" | sed "1, $ LINE_NUM" D'

RESPONSE = ""

7.2 set Cookie
Here we will talk about how curl sets cookies when sending HTTP requests. The simplest method is to use the "-B" parameter. For example:
Curl-B "h_ps_pssid= 1874" www.baidu.com

The preceding Command sends the following request:
GET, HTTP, 1.1
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: www.baidu.com
Accept :*/*
Cookie: H_PS_PSSID = 1874

Note that you cannot use multiple "-B" to set multiple cookies. The correct method is to combine multiple cookies into a string (separated by semicolons ), and use the synthesized string as the "-B" parameter value. For example:
Curl-B "H_PS_PSSID = 1874; date = 20130130" www.baidu.com

The preceding Command sends the following request:
GET, HTTP, 1.1
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: www.baidu.com
Accept :*/*
Cookie: H_PS_PSSID = 1874; date = 20130130

In essence, the "-B" parameter value can be a file path, and the file specified by this path can be a Cookie file generated by the "-c" parameter, it can also be an HTTP header information file generated by "-- dump-Header.

8. POST form information

The key to POST form information is to add "application/x-www-form-urlencoded" to "Content-Type:" of the HTTP Header and how to synthesize the HTTP Body. The following is a spoof command. You may wish to analyze it. I believe that it is not difficult to analyze this command after the previous content.
Curl-d "tbMemberName = happy_boby & tbPassword = 123456" http://www.netyi.net/controls/loginFp.aspx

The preceding Command sends the following request:
POST/controls/loginFp. aspx HTTP/1.1
User-Agent: curl/7.22.0 (i686-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
Host: www.netyi.net
Accept :*/*
Content-Length :**
Content-Type: application/x-www-form-urlencoded

TbMemberName = happy_boby & tbPassword = 123456 &

Conclusion

Because of work needs, I need to go to a fixed website every day to retrieve some data. This work is not complicated but time-consuming. So I want to write a script to automatically obtain data every day. During the script development process, I wrote some notes and became the main content of this article. This article focuses on the HTTP operations of curl. In fact, curl supports many common protocols, such as ftp, in addition to http. I strongly recommend that you take a look at the official curl User Manual:
Http://curl.haxx.se/docs/manual.html
Not a classic, but a practical work of the rising sun and the autumn red

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.