Requests implements most of the functions of the HTTP protocol, it provides features such as keep-alive, connection pooling, cookie persistence, content auto-decompression, HTTP proxy, SSL authentication and many other features. The following article mainly introduces the Python crawler in the introduction of the elegant HTTP library requests related information, the need for friends can refer to.
Objective
Urllib, Urllib2, URLLIB3, Httplib, and HTTPLIB2 are all Python modules that are related to HTTP, and the name is very anti-human, and worse, these modules differ greatly in Python2 and Python3, If your business code is compatible with both 2 and 3, writing can be a crash.
# GET Request >>> response = Requests.get (foofish.net)
Fortunately, there is also a very stunning HTTP library called requests, which is one of GitHUb's most watched Python projects, and requests's author is Kenneth Reitz, the great God.
Requests implements most of the functions of the HTTP protocol, which includes features such as keep-alive, connection pooling, cookie persistence, content auto-decompression, HTTP proxy, SSL authentication, connection timeout, session and so on, and most importantly, it is compatible Python2 and Python3. Requests installation can be directly using the PIP method: Pip Install requests
Send Request
>>> Import Requests
Response Content
The value returned by the request is a Response object, the Response object is the encapsulation of the response data returned to the browser by the server in the HTTP protocol, and the main elements in the response are: status code, reason phrase, response header, response body, and so on, which are encapsulated in the Response object.
# status Code >>> response.status_code200# reason phrase >>> Response.reason ' OK ' # response header >>> for Name,value in Response.headers.items (): ... Print ("%s:%s"% (name, value)) ... Content-encoding:gzipserver:nginx/1.10.2date:thu, APR 16:28:01 gmt# response content >>> Response.content ' < Html><body> omitted 10,000 words here ...</body>
In addition to supporting GET requests, requests supports all other methods in the HTTP specification, including POST, PUT, Deltet, Headt, and options methods.
>>> r = requests.post (' http://httpbin.org/post ', data = {' key ': ' Value '}) >>> r = Requests.put (' http:// Httpbin.org/put ', data = {' key ': ' Value '}) >>> r = Requests.delete (' http://httpbin.org/delete ') >>> r = Requests.head (' Http://httpbin.org/get ') >>> r = requests.options (' Http://httpbin.org/get ')
Query parameters
Many URLs have a long string of parameters, we call these parameters the query parameters of the URL, with "?" Appended to the URL link, multiple parameters are separated by "&", for example: http://fav.foofish.net/?p=4&s=20, now you can use a dictionary to construct query parameters:
>>> args = {"P": 4, "s": 20}>>> response = Requests.get ("http://fav.foofish.net", params = args) >> > Response.url ' http://fav.foofish.net/?p=4&s=2 '
Request Header
Requests can simply specify the request header field Headers, for example, sometimes to specify user-agent masquerading as a browser to send requests to deceive the server. Pass a Dictionary object directly to the parameter headers.
>>> r = requests.get (URL, headers={' user-agent ': ' mozilla/5.0 '})
Request Body
Requests can be very flexible to build the data required for the POST request, if the server requires the data sent is form data, you can specify the keyword parameter data, if you want to pass the JSON format string parameters, you can use the JSON keyword parameter, the value of the parameter can be passed in the form of a dictionary.
Data transfer to the server as a form
>>> payload = {' Key1 ': ' value1 ', ' key2 ': ' value2 '}>>> r = requests.post ("Http://httpbin.org/post", Data=payload)
Transfer to server as a string format in JSON format
>>> import json>>> url = ' http://httpbin.org/post ' >>> payload = {' Some ': ' Data '}>>> r = Requests.post (URL, json=payload)
Response Content
An important part of the response message returned by HTTP is the response body, which is very flexible to handle in requests, with the properties associated with the response body: content, text, JSON ().
Content is a byte type that is suitable for saving directly to a file system or to a network
>>> r = Requests.get ("https://pic1.zhimg.com/v2-2e92ebadb4a967829dcd7d05908ccab0_b.jpg") >>> type (r.content) <class ' bytes ' ># saved as test.jpg>>> with open ("Test.jpg", "WB") as F: ... F.write (r.content)
Text is a str type, such as a normal HTML page, which requires text to be parsed further.
>>> r = Requests.get ("https://foofish.net/understand-http.html") >>> type (r.text) <class ' str ' >>>> re.compile (' xxx '). FindAll (R.text)
If you use a third-party open platform or API interface to crawl data, the returned content is JSON-formatted data, you can directly use the JSON () method to return a json.loads()
processed object.
>>> r = requests.get (' Https://www.v2ex.com/api/topics/hot.json ') >>> R.json () [{' id ': 352833, ' Title ': ' In Changsha, parents live with ...
Proxy settings
When the crawler frequently crawl content on the server, it is easy to be blocked by the server, so to continue to crawl data smoothly, the use of proxies is a wise choice. If you want to crawl data outside the wall, the same setup agent can solve the problem, requests perfect support agent. Here I use the local shadowsocks agent, (Socks Protocol agent to install this pip install requests[socks]
)
Import requestsproxies = {' http ': ' socks5://127.0.0.1:1080 ', ' https ': ' socks5://127.0.0.1:1080 ',}requests.get (' HTTPS ://foofish.net ', proxies=proxies, timeout=5)
Timeout settings
Requests when a request is sent, the default request thread is blocked until a response is returned to process the subsequent logic. If you encounter a situation where the server is not responding, the problem becomes serious and it will cause the entire application to remain blocked and not be able to handle other requests.
>>> Import requests>>> r = Requests.get ("Http://www.google.coma") ... Keep plugging in.
The correct way is to specify a time-out for each request to be displayed.
>>> r = Requests.get ("Http://www.google.coma", timeout=5) error after 5 seconds Traceback (most recent call last): Socket.timeout:timed out
Session
In the Python crawler introductory Tutorial-Fast Understanding HTTP protocol (i) described in the HTTP protocol is a stateless protocol, in order to maintain the communication between the client and server state, using Cookie technology to maintain the communication status between the two sides.
Some Web pages are required to log in to perform crawler operations, and the principle of login is the first time the browser through the user name password login, the server sends a random cookie to the client, the next time the browser requests other pages, the cookie is sent along with the request to the server, This way the server knows that the user is already a logged-on user.
Import requests# Build Sessions session = requests. Session () # Login Urlsession.post (login_url, data={username, password}) # login to access URLR = Session.get (Home_url) Session.close ()
After a session is built, the client first initiates a request to login to the account, the server automatically saves the cookie information in the session object, and requests automatically sends the cookie information in session to the server to keep the communication status when the second request is initiated.
Project Combat
Finally is a practical project, how to use requests realize automatic login and send the user a private message, I will explain in the next article.
"Recommended"
1. Python Crawler Primer (5)--Regular Expression Example tutorial
2. Python Crawler Introduction (3)--using requests to build a knowledge API
3. Summarize Python's logical operators and
4. Python crawler Primer (1)--Quick understanding of HTTP protocol