Python crawler Learning one of the requests package usage methods

Source: Internet
Author: User

The Requests function library is one of the prerequisites for learning Python crawlers and can help us crawl easily. This article mainly refers to its official documentation.

Requests Installation:

The current version of requests is v2.11.1, which can be installed automatically via the Command Line window (run the cmd command) on Windows (very handy):

Install Requests
Collecting requests
Downloading REQUESTS-2.11.1-PY2.PY3-NONE-ANY.WHL <514kB>
Installing collected Packages:requests
Successfully installed requests-2.11.1

Send request to website:requests.get (URL)

>>> ImportRequests#Request watercress Reading website Dot>>> r = Requests.get ('https://book.douban.com/')# Returns an instance that contains a lot of information>>> R.status_code# Response Status code, 200 indicates the server has successfully processed the request
200
>>> r.headers[' Content-type ']# Browser According to this parameter determines how to parse the document, text is the document type, HTML is the type of text;
' Text/html; Charset=utf-8 '
>>> r.encoding# How to encode the requested Web page
' Utf-8 '
>>> R.text# The content of the requested Web page
...
>>> R.cookie# cookie content of Web pages
<requestscookiejar[cookie (version=0, name= ' bid ', value= ' mgadkt5vxtk ', Port=none, Port_specified=false, domain= '. Douban.com ', Domain_specified=true, domain_initial_dot=true, path= '/', Path_specified=true, Secure=False, expires= 1508551773, Discard=false, Comment=none, Comment_url=none, rest={}, Rfc2109=false)]>
>>> r.headers# The head of the page
{' x-xss-protection ': ' 1; Mode=block ', ' X-dae-app ': ' book ', ' x-content-type-options ': ' Nosniff ', ' content-encoding ': ' Gzip ', ' transfer-encoding ': ' chunked ', ' set-cookie ': ' Bid=mgadkt5vxtk; Expires=sat, 21-oct-17 02:09:33 GMT; domain=.douban.com; path=/', ' Expires ': ' Sun, 1 Jan 2006 01:00:00 GMT ', ' Vary ': ' accept-encoding ', ' keep-alive ': ' timeout=30 ', ' x-dae-node ': ' Dis16 ', ' x-douban-newbid ': ' mgadkt5vxtk ', ' x-douban-mobileapp ': ' 0 ', ' Connection ': ' keep-alive ', ' Pragma ': ' No-cache ', ' Cache-control ': ' Must-revalidate, No-cache, Private ', ' Date ': ' Fri, Oct 02:09:33 GMT ', ' Strict-transport-securit Y ': ' max-age=15552000; ', ' Server ': ' Dae ', ' content-type ': ' text/html; Charset=utf-8 '}
>>> R.url# The actual URL
U ' https://book.douban.com/'

HTTP defines different ways to interact with the server, with four basic methods: GET, POST, PUT, DELETE, a URL that corresponds to a resource on a network, and four methods that correspond to this resource query, modify, add, delete four operations. The above program uses the Requests.get () to read the information on the specified Web page without modifying the information, which is equivalent to "read-only". The Requests library provides all the basic HTTP request methods, all of which are done in one sentence:

1R = Requests.post ('https://book.douban.com/')2R = Requests.put ('https://book.douban.com/')3R = Requests.delete ('https://book.douban.com/')4R = Requests.head ('https://book.douban.com/')5R = Requests.options ('https://book.douban.com/')

pass parameters to URL :requests.get (URL, param=none)

Refer to Appendix 1 that a URL is followed by a query string, and now we build the URL manually, passing the query string to the back of the URL. The second parameter of Request.get () allows us to use the params keyword parameter to provide parameters in a dictionary, such as we want to pass key1=value1,key2=value2 to Http://httpbin.org/get inside, You can do this:

>>> param = {'Key1':'value1','Key2':'value2' Mr. # into a dictionary >>>param{'Key2':'value2','Key1':'value1'}>>> r = Requests.get ('Http://httpbin.org/get', params =param)>>>R.urlu'http://httpbin.org/get?key2=value2&key1=value1' # you can see that the parameters are separated by & between the parameter names and the parameter values.

# You can also pass in a list as a parameter value>>> param = {'Key1':'value1','Key2': ['value2','Value3']} >>>param{'Key2': ['value2','Value3'],'Key1':'value1'}>>> r = Requests.get ('Http://httpbin.org/get', params=param)>>>R.urlu'http://httpbin.org/get?key2=value2&key2=value3&key1=value1'

Response content:

now We can read the contents of the server response and the contents of the response are stored in the R.text.

>>> r = requests.get ('https://www.douban.com/')>>>  R.status_code200>>> r.encoding'utf-8'

Figure 1 part of "View Source" on Douban

Figure 2 part of R.text

Figure 3 Part of R.content

From Figure 1, figure 2 can be seen as: 1) The content of the Web page response is the source code of the Web page. 2) R.text and r.content are encoded, and the red line in Figure 1 and Figure 2 is the code for " providing books ": Figure 1 is Unicode encoding 16, Figure 2 is ' Iso-8859-1 ' Code.

Figure 4 Encoding of four Chinese characters for "providing books"

When you visit R.text, requests will encode it using r.encoding encoding, and if you change the value of r.encoding and then access T.text, it will be encoded in the new encoding format. Change ' utf-8 ' code to ' Iso-8859-1 ', and then visit R.text, get 5 of the content shown:

>>> r.encoding'utf-8'iso-8859-1'

Figure 5 uses the code of ' iso-8859-1 '

The difference between R.text and R.content is that What is it?

  R.text is a Unicode-encoded response (R.text is the content of the response Span style= "COLOR: #ff0000" >in Unicode ), R.content is a character-encoded response (R.content is the content of the response in bytes ); The Text property tries to return the contents of the response automatically after transcoding according to the encoding property, and if encoding is none,requests it will follow Chardet (what is this?) Guess the correct coding; If you want to get text, you can pass the R.text, if you want to take pictures, the file, you can pass R.content ( This is not very clear to me now ). For response content is binary file (slice) scenario, the Content property gets the original content of the response, in bytes, for example:

 from Import Image  from Import  = requests.get ('http://pic6.huitu.com/res/20130116/84481_20130116142820494200_1.jpg '  =# Get Image I.save ('sample.jpg'jpeg  ')       # Save Image

Figure 6 The R.content content of the picture

JSON response content:

The requests contains a JSON decoder to help with JSON data, and a brief introduction to JSON is shown in Appendix 2.

>>> r = Requests.get ('Https://github.com/timeline.json')>>>R.json () {u'Documentation_url': U'https://developer.github.com/v3/activity/events/#list-public-events', u'message': U'Hello there, Wayfaring stranger. If You\u2019re Reading this and you probably didn\u2019t see we blog post a couple of years back announcing that this AP I would go Away:http://git.io/17arog fear not, you should is able to get what is need from the shiny new Events API Inst EAD.'}>>>R.json<bound method Response.json of <response [410]>>

Appendix:

1.URL Composition:

The basic URL format is: " protocol://authorization/path? Query ", such as Baidu on the search for "Brad Pitt" and "url", the URL is as follows, the protocol is: HTTPS (Hypertext Transfer Protocol with Secure Sockets Layer) , the authorization is: www.baidu.com, the path is: s, the query is:? A lump in the back .

search for "Brad Pitt", out of the URL is: https://www.baidu.com/s?wd=Brad%20Pitt&rsv_spt=1&rsv_ Iqid=0xf7315fbc0002f13b&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv _enter=1&rsv_sug3=2&rsv_n=2&rsv_sug1=1&rsv_sug7=100&rsv_sug2=0&inputt=364&rsv_sug4 =530&rsv_sug=1

search "url", out of the URL is: https://www.baidu.com/s?wd=URL&rsv_spt=1&rsv_iqid=0xed90010e00032652 &issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_n=2 &rsv_sug3=1

The URL consists of three parts: the resource type, the host domain that holds the resource, and the resource file name . The general syntax for the URL is (optional in []): protocol://hostname[:p ort]/path/[:p arameters][?query] #fragmentpro

protocol: protocol, the most commonly used is the HTTP protocol; hostname: hostname, which refers to the domain Name System (DNS) hostname or IP address of the server used to store the resource; Post: Port number, integer, optional, various protocols have a default port number (HTTP 80 port), ignoring the time to adopt the scheme of default ports; Path: route, generally used to indicate a directory on the host or file address; parameters: Reference (optional), used to specify special parameters. query: Queries, which are used to pass parameters to dynamic Web pages (such as Web pages made using Php/asp/net technology), can have multiple parameters, separated by the symbol "&", and each "parameter name: The value of the argument" separated by "=". Fragment: Information fragment, which is a string that specifies a fragment in a network resource.

2.JSON:

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy to read and write as a data exchange language, and is also easy for machine parsing and generation. The format of the JSON data is: Name: value pairs. Intermediate: Separate (this is like the dict definition of Python). such as: R ' documentation_url ': U ' https://developer.github.com/v3/activity/events/#list-public-events '.

Reference:

[1] Python crawler tool One of the requests library usage: https://cuiqingcai.com/2556.html

[2] Requests: Let HTTP service Human (official Chinese document): http://cn.python-requests.org/zh_CN/latest/

[3] Http://docs.python-requests.org/zh_CN/latest/user/advanced.html

[4] Talking about the difference between get and post in http: http://www.cnblogs.com/hyddd/archive/2009/03/31/1426026.html

[5] W3school-html/css:http://www.w3school.com.cn/html/html_forms.asp

Python crawler Learning one of the requests package usage methods

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.