Python crawler URLError Exception Handling, pythonurlerror

Last Update:2016-02-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this section, we mainly talk about URLError and HTTPError, as well as some processing for them.

1. URLError

First, explain the possible causes of URLError:

No network connection, that is, the local machine cannot access the Internet
The specified server cannot be connected.
The server does not exist.

In the code, we need to use the try-retry t statement to enclose and capture the corresponding exceptions. The following is an example.

import urllib2 requset = urllib2.Request('http://www.xxxxx.com')try:  urllib2.urlopen(requset)except urllib2.URLError, e:  print e.reason

We used the urlopen method to access a nonexistent URL. The running result is as follows:

[Errno 11004] getaddrinfo failed

It indicates that the error code is 11004. The error is caused by getaddrinfo failed.

2. HTTPError

HTTPError is a subclass of URLError. When you send a request using the urlopen method, the server will correspond to a response object response, which contains a number "status code ". For example, if response is a "redirection", You need to locate another address to obtain the document. urllib2 will process this.

Otherwise, the urlopen will generate an HTTPError. is it in the corresponding status? The HTTP status code indicates the status of the response returned by the HTTP protocol. The status code is summarized as follows:

100: the client should continue sending the request. The client should continue to send the remaining part of the request, or ignore this response if the request has been completed.
101: after sending the final blank line of the response, the server will switch to the protocol defined in the Upgrade message header. Similar measures should be taken only when switching to a new protocol is more advantageous.
102: continue to process the status code extended by WebDAV (RFC 2518), which means that the processing will continue.
200: Successful request processing method: Obtain the response content for processing
201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
304: the requested resource is not updated. Processing Method: discard
400: Illegal Request Handling Method: discard
401: Unauthorized handling method: discard
403: forbidden processing method: discard
404: no handling method found: discard
500: internal server error the server encountered an unexpected situation, which caused it to fail to process the request. Generally, this problem occurs when the source code of the server is incorrect.
501: the server cannot identify that the server does not support a function required by the current request. When the server cannot identify the Request Method and cannot support its requests to any resource.
502: when the error gateway acts as a gateway or proxy server and tries to execute a request, it receives an invalid response from the upstream server.
503: the server cannot process requests due to temporary server maintenance or overload. This situation is temporary and will be restored after a period of time.

After an HTTPError instance is generated, a code attribute is generated. This is the error code sent by the server.
Because urllib2 can handle redirection for you, that is, the code starting with 3 can be processed, and the number in the range of-indicates success, you can only see the error number.

Let's write an example to see if the caught exception is HTTPError, which carries a code attribute, which is the error code. In addition, we printed the reason attribute, this is the property of its parent class URLError.

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.HTTPError, e:  print e.code  print e.reason

The running result is as follows:

403Forbidden

The error code is 403. The error is caused by Forbidden, which indicates that access to the server is prohibited.

We know that the parent class of HTTPError is URLError. Based on programming experience, exceptions of the parent class should be written after the child class exception. If the child class cannot be captured, exceptions of the parent class can be captured, so the above Code can be rewritten as follows:

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.HTTPError, e:  print e.codeexcept urllib2.URLError, e:  print e.reasonelse:  print "OK"

If HTTPError is captured, the output code will not process the URLError exception. If the error is not HTTPError, the system will capture the URLError exception and output the cause of the error.

You can also add the hasattr attribute to determine the attribute in advance. The code is rewritten as follows:

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.URLError, e:  if hasattr(e,"code"):    print e.code  if hasattr(e,"reason"):    print e.reasonelse:  print "OK"

First, identify the abnormal attribute to avoid attribute output errors.

The above is the introduction of URLError and HTTPError and the corresponding error handling methods.

Articles you may be interested in:

Definition and URL composition of a python crawler without basic writing
Using urllib2 to capture webpage content
Two important concepts in urllib2: Openers and Handlers
Guide to Using urllib2 to write python crawlers without basic knowledge
No basic write python crawler: Use Scrapy framework to write Crawlers
Python-based code sharing
Python implements simple crawler sharing for crawling links on pages
Multi-thread web crawler using python
In Python, The urllib + urllib2 + cookielib module write crawler practices
A simple example of using the urllib2 module in Python to write Crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler URLError Exception Handling, pythonurlerror

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler URLError Exception Handling, pythonurlerror

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support