Python crawler tutorial-elegant HTTP library requests (2) and pythonrequests

Source: Internet
Author: User

Python crawler tutorial-elegant HTTP library requests (2) and pythonrequests

Preface

Urllib, urllib2, urllib3, httplib, and httplib2 are HTTP-related Python modules. If you look at the Python Module name, you will find it anti-human. What's worse, these modules are very different in Python2 and Python3, if the business code needs to be compatible with both 2 and 3, writing will crash.

Fortunately, there is another amazing HTTP library called requests, which is one of the most popular Python projects on GitHUb. The author of requests is the great god of kenth Reitz.

Requests implements the vast majority of HTTP functions, it provides many features, including Keep-Alive, connection pool, Cookie persistence, automatic content extraction, HTTP proxy, SSL authentication, connection timeout, and Session, most importantly, it is compatible with python2 and python3 at the same time. To install requests, you can directly use pip: pip install requests

Send request

>>> Import requests # GET request >>> response = requests. get (https://foofish.net)

Response content

The value returned by the request is a Response object, which encapsulates the Response data returned by the server to the browser in the HTTP protocol. The main elements of the Response include: status Code, cause phrase, Response Header, Response body, and so on. These attributes are encapsulated in the Response object.

# Status code >>> response. status_code200 # cause phrase >>> response. reason 'OK' # response Header >>> for name, value in response. headers. items ():... print ("% s: % s" % (name, value ))... content-Encoding: gzipServer: nginx/1.10.2Date: Thu, 06 Apr 2017 16:28:01 GMT # response Content >>> response. content '

In addition to GET requests, requests also supports all other methods in the HTTP specification, including POST, PUT, DELTET, HEADT, and OPTIONS methods.

>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})>>> r = requests.delete('http://httpbin.org/delete')>>> r = requests.head('http://httpbin.org/get')>>> r = requests.options('http://httpbin.org/get')

Query Parameters

Many URLs contain a long string of parameters. These parameters are called URL query parameters "? "Appended after the URL link, multiple parameters are separated by" & ", such as: http://fav.foofish.net /? P = 4 & s = 20. Now you can use the dictionary to construct query parameters:

>>> args = {"p": 4, "s": 20}>>> response = requests.get("http://fav.foofish.net", params = args)>>> response.url'http://fav.foofish.net/?p=4&s=2'

Request Header

Requests can easily specify the request header field Headers. For example, you may want to specify the User-Agent to send requests in a browser to cheat the server. You can directly pass a dictionary object to the parameter headers.

>>> r = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})

Request body

Requests can flexibly construct the data required by the POST request. If the server requires that the data be sent as form data, you can specify the keyword parameter data. If you need to pass the json format string parameter, the json keyword parameter can be used. The parameter values can be passed in a dictionary.

Transmit form data to the server

>>> payload = {'key1': 'value1', 'key2': 'value2'}>>> r = requests.post("http://httpbin.org/post", data=payload)

Transmitted to the server as a string in json format

>>> import json>>> url = 'http://httpbin.org/post'>>> payload = {'some': 'data'}>>> r = requests.post(url, json=payload)

Response content

A very important part of the response message returned by HTTP is the response body. The response body is flexible to process in requests. attributes related to the response body include content, text, and json ().

The content type is byte. It is suitable for directly saving the content to the file system or transmitting it to the network.

>>> R = requests. get ("https://pic1.zhimg.com/v2-2e92ebadb4a967829dcd7d05908ccab0_ B .jpg")> type (r. content) <class 'bytes '> # Save as test.jpg> with open ("test.jpg", "wb") as f :... f. write (r. content)

Text is 'str' type. For example, if a common HTML page needs to be further analyzed, text is used.

>>> r = requests.get("https://foofish.net/understand-http.html")>>> type(r.text)<class 'str'>>>> re.compile('xxx').findall(r.text)

If you use a third-party open platform or API to crawl data and the returned content is in json format, you can directly use the json () method to returnjson.loads()The processed object.

>>> R = requests. get ('https: // www.v2ex.com/api/topics/hot.json')> r. json () [{'id': 352833, 'title': 'In Changsha, the parents live together...

Proxy Settings

When crawlers frequently crawl content on the server, they are easily blocked by the server. Therefore, it is wise to use a proxy to continue crawling data smoothly. If you want to crawl shard data, setting proxy can solve the problem. requests supports proxy perfectly. Here, I use the local ShadowSocks proxy.pip install requests[socks] )

import requestsproxies = { 'http': 'socks5://127.0.0.1:1080', 'https': 'socks5://127.0.0.1:1080',}requests.get('https://foofish.net', proxies=proxies, timeout=5)

Timeout settings

When requests sends a request, the thread in the default request is blocked until a response is returned. If the server does not respond, the problem becomes very serious. This will cause the entire application to be blocked and unable to process other requests.

>>> Import requests >>> r = requests. get ("http://www.google.coma")... keep blocking

The correct method is to specify a timeout time for each request to display.

>>> R = requests. get ("http://www.google.coma", timeout = 5) Error Traceback (most recent call last): socket. timeout: timed out after 5 seconds

Session

In the python crawler tutorial-quick understanding of HTTP protocol (I), I introduced that HTTP is a stateless protocol. To maintain the communication status between the client and the server, use Cookie technology to maintain the communication status between the two parties.

Some web pages require logon before crawling. The principle of logon is that the server sends a random Cookie to the client after the browser logs on with the user name and password for the first time. The next time the browser requests other pages, send the cookie to the server along with the request, so that the server will know that the user is already logged on to the server.

Import requests # Build session = requests. session () # log on to urlsession. post (login_url, data = {username, password}) # urlr = session that can be accessed only after logon. get (home_url) session. close ()

After a session is created, the client initiates a login account request for the first time, and the server automatically saves the cookie information in the session object. When the second request is initiated, requests automatically sends the cookie information in the session to the server, maintain the communication status.

Project Practice

Finally, it is a practical project. I will explain how to use requests to implement automatic logon and send a private message to users.

Summary

Well, the above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message, thank you for your support.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.