Reproduced Advanced usage of the Python crawler four Urllib library

Last Update:2015-09-29 Source: Internet

Author: User

Tags soap

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Turn from: HTTP://CUIQINGCAI.COM/954.HTML1. Set headers

Some sites do not agree to the program directly in the way of access, if the identification of the problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties.

First of all, open our browser, debugging browser F12, I use Chrome, open the network monitoring, as shown below, for example, after the login, we will find that the interface has changed after landing, a new interface, essentially this page contains a lot of content, These content is not a one-time loading completed, in essence, the implementation of a good number of requests, is generally the first request for HTML files, and then load js,css and so on, after many requests, the skeleton and muscle of the Web page, the effect of the entire Web page out.

Split these requests, we only see a first request, you can see, there is a request URL, and headers, the following is response, the picture is not full, the small partners can experiment with their own hands. So this header contains a lot of information, file encoding, compression, request agent, and so on.

Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set up the agent in headers, such as the following example, this example just explains how to set the headers, the small partners to look at the format is good.

1234567891011

import urllib import urllib2 URL = ' http://www.server.com/login ' user_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ' values = { ' username ' : ' CQC ' , password ' : Span class= "crayon-s" > ' XXXX ' } headers = { ' user-agent ' : User_agent } data = urllib. UrlEncode(values) request = urllib2. Request(url, data, headers) response = urllib2. Urlopen(request) page = response. Read()

In this way, we set up a headers, passed in when the request was built, and, when requested, joined the headers transfer, and the server would be responsive if it identified a request from the browser.

In addition, we have to deal with the "anti-hotlinking" way to deal with the anti-theft chain, the server will identify headers in the referer is not its own, if not, some servers will not respond, so we can also add referer in headers

For example, we can build the following headers

12	headers = { ' User-Agent ' Span class= "crayon-h" > : ' mozilla/4.0 ( Compatible MSIE 5.5; Windows NT) ' ' Referer ' : ' http./ Www.zhihu.com/articles ' }

With the above method, the headers is passed into the request parameter during the transfer , so that the anti-theft chain can be dealt with.

In addition to some of the properties of headers, the following need special attention:

User-agent: Some servers or proxies will use this value to determine whether a request is made by a browser
Content-type: When using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed.
application/xml: Used in XML RPC, such as Restful/soap calls
Application/json: Used in JSON RPC calls
Application/x-www-form-urlencoded: Used when a Web form is submitted by the browser
Content-type setting errors cause server denial of service when using RESTful or SOAP services provided by the server

Others are necessary to review the browser's headers content and write the same data at build time.

2. Proxy (agent) settings

URLLIB2 uses the environment variable http_proxy to set the HTTP proxy by default. If a website detects the number of times an IP is accessed for a certain period of time, it will prohibit you from accessing it if there are too many accesses. So you can set up some proxy server to help you work, every time for a proxy, website June don't know who is the mischief, this acid cool!

The following code illustrates the use of proxy settings

123456789

import urllib2 enable_proxy = True proxy_handler = urllib2. Proxyhandler({"http" : ' http://some-proxy.com:8080 '}) null_proxy_handler = urllib2. Proxyhandler({}) if enable_proxy: opener = urllib2. Build_opener(proxy_handler) else: opener = urllib2. Build_opener(null_proxy_handler) urllib2. Install_opener(opener)

3. Timeout Setting

The previous section has already said Urlopen method, the third parameter is the timeout setting, you can set how long to wait for the timeout, in order to solve some of the site is too slow to respond to the impact.

For example, the following code, if the second parameter data is empty, then specifically specify how much timeout is, the formal parameter, if the data has been passed, it is not necessary to declare.

12	import urllib2 response = urllib2. Urlopen(' http://www.baidu.com ', timeout=ten)

12	import urllib2 response = urllib2. Urlopen(' http://www.baidu.com ',data, ten)

4. PUT and DELETE methods using HTTP

HTTP protocol has six kinds of request methods, get,head,put,delete,post,options, we sometimes need to use the Put method or Delete method request.

PUT: This method is relatively rare. This is not supported by HTML forms. In essence, put and post are very similar, are sending data to the server, but there is an important difference between them, put usually specifies the location of the resources, and post is not, post data storage location by the server itself.
Delete: Deletes a resource. This is mostly rare, but there are some places like Amazon's S3 cloud service that use this method to delete resources.

If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can make it possible for URLLIB2 to send a PUT or delete request in the following way, but the number of times it is used is really small, as mentioned here.

1234	import urllib2 request = urllib2. Request(uri, Data=data) request. Get_method = lambda: ' PUT ' # or ' DELETE ' response = urllib2. Urlopen(request)

5. Using Debuglog

You can use the following method to open the debug Log, so that the contents of the transceiver will be printed on the screen, easy to debug, this is not very common, just mention

123456

import urllib2 HttpHandler = urllib2. HttpHandler(debuglevel=1) httpshandler = urllib2. Httpshandler(debuglevel=1) opener = urllib2. Build_opener(httphandler, httpshandler) urllib2. Install_opener(opener) response = urllib2. Urlopen(' http://www.baidu.com ')

The above is a part of the advanced features, the first three are important content, in the back, there are also cookies settings and abnormal processing, small partners refueling!

Reproduced Advanced usage of the Python crawler four Urllib library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More