Python crawler URLError Exception Handling, pythonurlerror

Source: Internet
Author: User

Python crawler URLError Exception Handling, pythonurlerror
1. URLError

 

First, explain the possible causes of URLError:

  • No network connection, that is, the local machine cannot access the Internet
  • The specified server cannot be connected.
  • The server does not exist.

In the code, we need to use the try-retry t statement to enclose and capture the corresponding exceptions. The following is an example:

 1 #coding:UTF8 2  3 import urllib2 4  5 request = urllib2.Request('http://www.!!!!.com') 6  7 try: 8     urllib2.urlopen(request) 9 except urllib2.URLError,e:10     print e.reason

The urlopen method is used to access a nonexistent URL. The running result is as follows:

[Errno 11004] getaddrinfo failed

It indicates that the error code is 11004. The error is caused by getaddrinfo failed.

2. HTTPError

 

HTTPError is a subclass of URLError. When you send a request using the urlopen method, the server will correspond to a response object response, which contains a number "status code ". For example, if response is a "redirection", You need to locate another address to obtain the document. urllib2 will process this.

Otherwise, the urlopen will generate an HTTPError. Does it correspond to the corresponding status? The HTTP status code indicates the status of the response returned by the HTTP protocol. The status code is summarized as follows:

1 100: the client should continue sending the request. The client should continue to send the remaining part of the request, or ignore this response if the request has been completed. 2 101: after sending the final blank line of the response, the server will switch to the protocol defined in the Upgrade message header. Similar measures should be taken only when switching to a new protocol is more advantageous. 3 102: Continue processing the status code extended by WebDAV (RFC 2518), which indicates that the processing will continue. 4 200: Successful request processing method: Obtain the response content for processing 5 201: the request is complete, and the result is that a new resource is created. The URI of the newly created resource can be processed in the response object. Method: crawler does not encounter 202: the request is accepted, but the processing is not completed. Method: blocked wait 7 204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Solution: discard 300: this status code is not directly used by HTTP/1.0 applications, but is the default explanation for 3XX response. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will discard 9 301: the requested resources will be allocated with a permanent URL, in this way, you can access this resource using this URL in the future. Processing Method: redirect to the allocated URL10 302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to temporary URL11 304: requested resource not updated processing method: discard 12 400: Illegal Request Processing Method: discard 13 401: unauthorized processing method: discard 14 403: forbidden processing method: discard 15 404: no processing method found: discard 16 500: Internal Server Error server encountered an unexpected situation, resulting in failure to process the request. Generally, this problem occurs when the source code of the server is incorrect. 17 501: the server cannot identify that the server does not support a function required by the current request. When the server cannot identify the Request Method and cannot support its requests to any resource. 18 502: when the error gateway acts as a gateway or proxy server and tries to execute a request, it receives an invalid response from the upstream server. 19 503: the server cannot process requests due to temporary server maintenance or overload. This situation is temporary and will be restored after a period of time.

After an HTTPError instance is generated, a code attribute is generated. This is the error code sent by the server.
Because urllib2 can handle redirection for you, that is, the code starting with 3 can be processed, and the number in the range of-indicates success, you can only see the error number.

Let's take an example. The caught exception is HTTPError, which carries a code attribute, that is, the error code. In addition, we printed the reason attribute, this is the property of its parent class URLError.

 1 #coding:UTF8 2  3 import urllib2 4  5 request = urllib2.Request('http://www.blog.com/cqcre') 6  7 try: 8     urllib2.urlopen(request) 9 except urllib2.HTTPError,e:10     print e.code11     print e.reason

The running result is as follows:

502Bad Gateway

The error code is 502. The error cause is 'wrong gateway'

We know that the parent class of HTTPError is URLError. Based on programming experience, exceptions of the parent class should be written after the child class exception. If the child class cannot be captured, exceptions of the parent class can be captured, so the above Code can be rewritten as follows:

 1 #coding:UTF8 2  3 import urllib2 4  5 request = urllib2.Request('http://blog.csdn.com/cqcre') 6  7 try: 8     urllib2.urlopen(request) 9 except urllib2.HTTPError,e:10     print e.code11 except urllib2.URLError,e:12     print e.reason13 else:14     print 'OK'

If HTTPError is captured, the output code will not process the URLError exception. If the error is not HTTPError, the system will capture the URLError exception and output the cause of the error.

You can also add the hasattr attribute to determine the attribute in advance. The code is rewritten as follows:

 1 #coding:UTF8 2  3 import urllib2 4  5 request = urllib2.Request('http://blog.csdn.com/cqcre') 6  7 try: 8     urllib2.urlopen(request) 9 except urllib2.URLError,e:10     if hasattr(e, 'reason'):11         print e.reason12 else:13     print 'OK'

First, identify the abnormal attribute to avoid attribute output errors.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.