Getting started with Python crawlers: advanced use of the Urllib library

Last Update:2017-10-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tag: The monitor agent causes the IMP load to complete by sending a request agent

1. Set Headers

Some sites do not agree to the program directly in the way of access, if the identification of a problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some The properties of the Headers.

First, open our browser and debug your browser F12, I use Chrome, open the network monitoring, indicating that, for example, after the login, we will find that the interface has changed after landing, a new interface, in essence, this page contains a lot of content, these content is not a one-time loading completed, in essence, the implementation of a lot of requests , the general is the first request for HTML files, and then load js,css, and so on, after many requests, the skeleton and muscle of the Web page, the effect of the whole page is out.

split these requests, we only look at one first request, as you can see, there is a Request URL, there are headers, the following is response, the picture is not full, the small partners can experiment with their own hands. So this header contains a lot of information, file encoding, compression, request agent, and so on.

wherethe agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set the agent in headers, such as the following example, this example is just how to set the headers, Let's look at the format for the little friends.

Python

Import urllib

import urllib2

url = ' http://zhimaruanjian.com /'

user_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '

values = {' username ' : ' CQC ', ' password ' : ' XXXX ' }

headers = { ' user-agent ' : user_agent }

Data = Urllib.urlencode (values)

request = urllib2. Request (Url, data, headers)

response = urllib2.urlopen (Request)

page = response.read ()

In this way, we set up a headers, passed in when the request was built, and, when requested, joined the headers transfer, and the server would be responsive if it identified a request from the browser.

In addition, we also have against "anti-hotlinking" the way to deal with the anti-theft chain, the server will identify headers in the referer is not its own, if not, some servers will not respond, so we can also add referer in the headers

For example, we can build the following headers

Python

1 2	headers = {' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ', ' Referer ': ' Http://www.zhihu.com/articles '}

with the above method, the headers is passed into the request parameter during the transfer, so that the anti-theft chain can be dealt with.

In addition to some of the properties of headers, the following need special attention:

User-agent: Some servers or proxies will use this value to determine whether a request is made by a browser
Content-type: When using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed.
Application/xml: Used in XML RPC, such as Restful/soap call
Application/json: Used in JSON RPC calls
Application/x-www-form-urlencoded: Used when a Web form is submitted by the browser
when using the server-provided Content-type setting error causes server denial of service when RESTful or SOAP service

others are necessary to review the browser's headers content and write the same data at build time.

2. Proxy (agent) settings

URLLIB2 uses the environment variable http_proxy to set the HTTP proxy by default. If a website detects the number of times an IP is accessed for a certain period of time, it will prohibit you from accessing it if there are too many accesses. So you can set up some proxy server to help you work, every time for a proxy, website June don't know who is the mischief, this acid cool!

The following section of code illustrates Agent the setting Usage

Python

Import Urllib2

Enable_proxy = True

Proxy_handler = Urllib2. Proxyhandler ({"http": ' http://some-proxy.com:8080 '})

Null_proxy_handler = Urllib2. Proxyhandler ({})

if enable_proxy:

Opener = Urllib2.build_opener (Proxy_handler)

Else:

Opener = Urllib2.build_opener (Null_proxy_handler)

Urllib2.install_opener (opener)

3. Timeout Setting

The previous section has already said Urlopen method, the third parameter is the timeout setting, you can set how long to wait for the timeout, in order to solve some of the site is too slow to respond to the impact.

For example, the following code , if the second parameter data is empty, then specifically specify how much timeout is, the formal parameter, if the data has been passed, it is not necessary to declare .

Python

1 2	Import Urllib2 Response = Urllib2.urlopen (' http://zhimaruanjian.com/', timeout=10)

Python

1 2	Import Urllib2 Response = Urllib2.urlopen (' http://zhimaruanjian.com/', data, 10)

4. PUT and DELETE methods using HTTP

HTTP protocol has six kinds of request methods, get,head,put,delete,post,options, we sometimes need to use the Put method or Delete method request.

PUT: This method is relatively rare. This is not supported by HTML forms. In essence, put and post are very similar, are sending data to the server, but there is an important difference between them, put usually specifies the location of the resources, and post is not, post data storage location by the server itself.
Delete: Deletes a resource. This is mostly rare, but there are some places like Amazon's S3 cloud service that use this method to delete resources.

If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can make it possible for URLLIB2 to send a PUT or delete request in the following way, but the number of times it is used is really small, as mentioned here.

Python

Import Urllib2

Request = Urllib2. Request (URI, Data=data)

Request.get_method = Lambda: ' PUT ' # or ' DELETE '

Response = Urllib2.urlopen (Request)

5. Using Debuglog

You can use the following method to Debug Log Open, so that the contents of the transceiver will be printed on the screen, easy to debug, this is not very common, just mention

Python

Import Urllib2

HttpHandler = Urllib2. HttpHandler (debuglevel=1)

Httpshandler = Urllib2. Httpshandler (debuglevel=1)

Opener = Urllib2.build_opener (HttpHandler, Httpshandler)

Urllib2.install_opener (opener)

Response = Urllib2.urlopen (' http://www.baidu.com ')

The above is a part of the advanced features, the first three are important content, in the back, there are also cookies settings and abnormal processing, small partners refueling!

Getting started with Python crawlers: advanced use of the Urllib library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More