This section describes some advanced usage of the Python Urllib library.

Source: Internet
Author: User
This article mainly introduces some advanced usage of the Python Urllib library, which is a basic knowledge of programming crawlers in Python. For more information, see 1. Set Headers

Some websites do not agree that the program will directly use the above method for access. If there is a problem with identification, the site will not respond at all. Therefore, to fully simulate the work of the browser, we need to set some Headers attributes.

First, open our browser and debug the browser F12. I use Chrome to open the network listener, as shown in the following figure. For example, after logging in, we will find that the interface has changed after login, and a new interface appears. In fact, this page contains a lot of content, which is not loaded at once, in essence, it is to execute many requests. Generally, it first requests HTML files and then loads JS and CSS files. After multiple requests, the skeleton and muscles of the webpage are complete, the effect of the entire web page is displayed.

Split these requests. We only look at the first Request. You can see that there is a Request URL and headers. The following is response. The picture is incomplete. You can try it yourself. The header contains a lot of information, such as file encoding, compression, and request agent.

The agent is the request identity. If the request identity is not written, the server may not respond. Therefore, you can set the agent in headers. For example, in the following example, this example only shows how to set headers. Let's take a look at setting the format.

import urllib import urllib2  url = 'http://www.server.com/login'user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' values = {'username' : 'cqc', 'password' : 'XXXX' } headers = { 'User-Agent' : user_agent } data = urllib.urlencode(values) request = urllib2.Request(url, data, headers) response = urllib2.urlopen(request) page = response.read()

In this way, we set a headers, which is passed in when the request is built, and the headers transfer is added during the request. If the server identifies a request sent by the browser, it will get a response.

In addition, we can also deal with anti-leeching. When dealing with anti-leeching, the server will identify whether the referer In the headers is its own. If not, some servers will not respond, so we can add referer to headers.

For example, we can build the following headers

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,            'Referer':'http://www.zhihu.com/articles' }

In the same method as above, the headers is passed into the Request parameter when the Request is sent, so that the anti-leech can be handled.

In addition, pay special attention to the following headers attributes:

  1. User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.
  2. Content-Type: When the REST interface is used, the server checks the value to determine how to parse the Content in the HTTP Body.
  3. Application/xml: Used in xml rpc, such as RESTful/SOAP calls
  4. Application/json: used for json rpc calls
  5. Application/x-www-form-urlencoded: used when the browser submits a Web form
  6. When using the RESTful or SOAP service provided by the server, the Content-Type setting error may cause the server to reject the service.

Others may need to review the browser's headers content and write the same data during build.
2. Proxy Settings

By default, urllib2 uses the environment variable http_proxy to set HTTP Proxy. If a website detects the number of visits to an IP address in a certain period of time, if the number of visits is too large, it will prohibit your access. So you can set up some proxy servers to help you do your work. If you change to a proxy every other time, the website has no idea who is playing tricks. This is cool!

The following code describes how to set the proxy.

import urllib2enable_proxy = Trueproxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})null_proxy_handler = urllib2.ProxyHandler({})if enable_proxy:  opener = urllib2.build_opener(proxy_handler)else:  opener = urllib2.build_opener(null_proxy_handler)urllib2.install_opener(opener)

3. Timeout settings

The urlopen method has been mentioned in the previous section. The third parameter is the timeout setting. You can set how long the wait times out. In order to solve the problem caused by slow response of some websites.

For example, in the following code, if the second parameter data is null, you must specify the timeout value and specify the form parameter. If the data has been passed in, you do not need to declare it.

import urllib2response = urllib2.urlopen('http://www.baidu.com', timeout=10) import urllib2response = urllib2.urlopen('http://www.baidu.com',data, 10)

4. Use the PUT and DELETE methods of HTTP

There are six http request methods: get, head, put, delete, post, and options. Sometimes we need to use PUT or DELETE requests.

PUT: This method is rare. This is also not supported by HTML forms. Essentially, PUT and POST are very similar. They both send data to the server, but there is an important difference between them. PUT usually specifies the storage location of resources, while POST does not, the storage location of POST data is determined by the server.
DELETE: DELETE a resource. This is also rare, but there are still some places, such as the method used in amazon's S3 cloud service, to delete resources.

If you want to use http put and DELETE, you can only use httplib libraries of lower layers. Even so, we can still use the following method to enable urllib2 to send a PUT or DELETE request, but the number of times is indeed small. Here we will mention it.

import urllib2request = urllib2.Request(uri, data=data)request.get_method = lambda: 'PUT' # or 'DELETE'response = urllib2.urlopen(request)

5. Use DebugLog

You can use the following method to open the Debug Log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. This is not very common. just mention it.


import urllib2httpHandler = urllib2.HTTPHandler(debuglevel=1)httpsHandler = urllib2.HTTPSHandler(debuglevel=1)opener = urllib2.build_opener(httpHandler, httpsHandler)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.baidu.com')

The above are some of the advanced features. The first three are important content. In the future, there are cookies and settings for exception handling. Let's cheer up!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.