Python crawler tutorial-elegant HTTP library requests (2) and pythonrequests

Last Update:2017-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

Urllib, urllib2, urllib3, httplib, and httplib2 are HTTP-related Python modules. If you look at the Python Module name, you will find it anti-human. What's worse, these modules are very different in Python2 and Python3, if the business code needs to be compatible with both 2 and 3, writing will crash.

Fortunately, there is another amazing HTTP library called requests, which is one of the most popular Python projects on GitHUb. The author of requests is the great god of kenth Reitz.

Requests implements the vast majority of HTTP functions, it provides many features, including Keep-Alive, connection pool, Cookie persistence, automatic content extraction, HTTP proxy, SSL authentication, connection timeout, and Session, most importantly, it is compatible with python2 and python3 at the same time. To install requests, you can directly use pip: pip install requests

Send request

>>> Import requests # GET request >>> response = requests. get (https://foofish.net)

Response content

The value returned by the request is a Response object, which encapsulates the Response data returned by the server to the browser in the HTTP protocol. The main elements of the Response include: status Code, cause phrase, Response Header, Response body, and so on. These attributes are encapsulated in the Response object.

# Status code >>> response. status_code200 # cause phrase >>> response. reason 'OK' # response Header >>> for name, value in response. headers. items ():... print ("% s: % s" % (name, value ))... content-Encoding: gzipServer: nginx/1.10.2Date: Thu, 06 Apr 2017 16:28:01 GMT # response Content >>> response. content '
In addition to GET requests, requests also supports all other methods in the HTTP specification, including POST, PUT, DELTET, HEADT, and OPTIONS methods.
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})>>> r = requests.delete('http://httpbin.org/delete')>>> r = requests.head('http://httpbin.org/get')>>> r = requests.options('http://httpbin.org/get')
Query Parameters
Many URLs contain a long string of parameters. These parameters are called URL query parameters "? "Appended after the URL link, multiple parameters are separated by" & ", such as: http://fav.foofish.net /? P = 4 & s = 20. Now you can use the dictionary to construct query parameters:
>>> args = {"p": 4, "s": 20}>>> response = requests.get("http://fav.foofish.net", params = args)>>> response.url'http://fav.foofish.net/?p=4&s=2'
Request Header
Requests can easily specify the request header field Headers. For example, you may want to specify the User-Agent to send requests in a browser to cheat the server. You can directly pass a dictionary object to the parameter headers.
>>> r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
Request body
Requests can flexibly construct the data required by the POST request. If the server requires that the data be sent as form data, you can specify the keyword parameter data. If you need to pass the json format string parameter, the json keyword parameter can be used. The parameter values can be passed in a dictionary.
Transmit form data to the server
>>> payload = {'key1': 'value1', 'key2': 'value2'}>>> r = requests.post("http://httpbin.org/post", data=payload)
Transmitted to the server as a string in json format
>>> import json>>> url = 'http://httpbin.org/post'>>> payload = {'some': 'data'}>>> r = requests.post(url, json=payload)
Response content
A very important part of the response message returned by HTTP is the response body. The response body is flexible to process in requests. attributes related to the response body include content, text, and json ().
The content type is byte. It is suitable for directly saving the content to the file system or transmitting it to the network.
>>> R = requests. get ("https://pic1.zhimg.com/v2-2e92ebadb4a967829dcd7d05908ccab0_ B .jpg")> type (r. content) <class 'bytes '> # Save as test.jpg> with open ("test.jpg", "wb") as f :... f. write (r. content)
Text is 'str' type. For example, if a common HTML page needs to be further analyzed, text is used.
>>> r = requests.get("https://foofish.net/understand-http.html")>>> type(r.text)<class 'str'>>>> re.compile('xxx').findall(r.text)
If you use a third-party open platform or API to crawl data and the returned content is in json format, you can directly use the json () method to returnjson.loads()The processed object.
>>> R = requests. get ('https: // www.v2ex.com/api/topics/hot.json')> r. json () [{'id': 352833, 'title': 'In Changsha, the parents live together...
Proxy Settings
When crawlers frequently crawl content on the server, they are easily blocked by the server. Therefore, it is wise to use a proxy to continue crawling data smoothly. If you want to crawl shard data, setting proxy can solve the problem. requests supports proxy perfectly. Here, I use the local ShadowSocks proxy.pip install requests[socks] )
import requestsproxies = { 'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080',}requests.get('https://foofish.net', proxies=proxies, timeout=5)
Timeout settings
When requests sends a request, the thread in the default request is blocked until a response is returned. If the server does not respond, the problem becomes very serious. This will cause the entire application to be blocked and unable to process other requests.
>>> Import requests >>> r = requests. get ("http://www.google.coma")... keep blocking
The correct method is to specify a timeout time for each request to display.
>>> R = requests. get ("http://www.google.coma", timeout = 5) Error Traceback (most recent call last): socket. timeout: timed out after 5 seconds
Session
In the python crawler tutorial-quick understanding of HTTP protocol (I), I introduced that HTTP is a stateless protocol. To maintain the communication status between the client and the server, use Cookie technology to maintain the communication status between the two parties.
Some web pages require logon before crawling. The principle of logon is that the server sends a random Cookie to the client after the browser logs on with the user name and password for the first time. The next time the browser requests other pages, send the cookie to the server along with the request, so that the server will know that the user is already logged on to the server.
Import requests # Build session = requests. session () # log on to urlsession. post (login_url, data = {username, password}) # urlr = session that can be accessed only after logon. get (home_url) session. close ()
After a session is created, the client initiates a login account request for the first time, and the server automatically saves the cookie information in the session object. When the second request is initiated, requests automatically sends the cookie information in the session to the server, maintain the communication status.
Project Practice
Finally, it is a practical project. I will explain how to use requests to implement automatic logon and send a private message to users.
Summary
Well, the above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message, thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler tutorial-elegant HTTP library requests (2) and pythonrequests

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler tutorial-elegant HTTP library requests (2) and pythonrequests

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support