Python crawler URLError Exception Handling, pythonurlerror
1. URLError
First, explain the possible causes of URLError:
- No network connection, that is, the local machine cannot access the Internet
- The specified server cannot be connected.
- The server does not exist.
In the code, we need to use the try-retry t statement to enclose and capture the corresponding exceptions. The following is an example:
1 #coding:UTF8 2 3 import urllib2 4 5 request = urllib2.Request('http://www.!!!!.com') 6 7 try: 8 urllib2.urlopen(request) 9 except urllib2.URLError,e:10 print e.reason
The urlopen method is used to access a nonexistent URL. The running result is as follows:
[Errno 11004] getaddrinfo failed
It indicates that the error code is 11004. The error is caused by getaddrinfo failed.
2. HTTPError
HTTPError is a subclass of URLError. When you send a request using the urlopen method, the server will correspond to a response object response, which contains a number "status code ". For example, if response is a "redirection", You need to locate another address to obtain the document. urllib2 will process this.
Otherwise, the urlopen will generate an HTTPError. Does it correspond to the corresponding status? The HTTP status code indicates the status of the response returned by the HTTP protocol. The status code is summarized as follows:
1 100: the client should continue sending the request. The client should continue to send the remaining part of the request, or ignore this response if the request has been completed. 2 101: after sending the final blank line of the response, the server will switch to the protocol defined in the Upgrade message header. Similar measures should be taken only when switching to a new protocol is more advantageous. 3 102: Continue processing the status code extended by WebDAV (RFC 2518), which indicates that the processing will continue. 4 200: Successful request processing method: Obtain the response content for processing 5 201: the request is complete, and the result is that a new resource is created. The URI of the newly created resource can be processed in the response object. Method: crawler does not encounter 202: the request is accepted, but the processing is not completed. Method: blocked wait 7 204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Solution: discard 300: this status code is not directly used by HTTP/1.0 applications, but is the default explanation for 3XX response. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will discard 9 301: the requested resources will be allocated with a permanent URL, in this way, you can access this resource using this URL in the future. Processing Method: redirect to the allocated URL10 302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to temporary URL11 304: requested resource not updated processing method: discard 12 400: Illegal Request Processing Method: discard 13 401: unauthorized processing method: discard 14 403: forbidden processing method: discard 15 404: no processing method found: discard 16 500: Internal Server Error server encountered an unexpected situation, resulting in failure to process the request. Generally, this problem occurs when the source code of the server is incorrect. 17 501: the server cannot identify that the server does not support a function required by the current request. When the server cannot identify the Request Method and cannot support its requests to any resource. 18 502: when the error gateway acts as a gateway or proxy server and tries to execute a request, it receives an invalid response from the upstream server. 19 503: the server cannot process requests due to temporary server maintenance or overload. This situation is temporary and will be restored after a period of time.
After an HTTPError instance is generated, a code attribute is generated. This is the error code sent by the server.
Because urllib2 can handle redirection for you, that is, the code starting with 3 can be processed, and the number in the range of-indicates success, you can only see the error number.
Let's take an example. The caught exception is HTTPError, which carries a code attribute, that is, the error code. In addition, we printed the reason attribute, this is the property of its parent class URLError.
1 #coding:UTF8 2 3 import urllib2 4 5 request = urllib2.Request('http://www.blog.com/cqcre') 6 7 try: 8 urllib2.urlopen(request) 9 except urllib2.HTTPError,e:10 print e.code11 print e.reason
The running result is as follows:
502Bad Gateway
The error code is 502. The error cause is 'wrong gateway'
We know that the parent class of HTTPError is URLError. Based on programming experience, exceptions of the parent class should be written after the child class exception. If the child class cannot be captured, exceptions of the parent class can be captured, so the above Code can be rewritten as follows:
1 #coding:UTF8 2 3 import urllib2 4 5 request = urllib2.Request('http://blog.csdn.com/cqcre') 6 7 try: 8 urllib2.urlopen(request) 9 except urllib2.HTTPError,e:10 print e.code11 except urllib2.URLError,e:12 print e.reason13 else:14 print 'OK'
If HTTPError is captured, the output code will not process the URLError exception. If the error is not HTTPError, the system will capture the URLError exception and output the cause of the error.
You can also add the hasattr attribute to determine the attribute in advance. The code is rewritten as follows:
1 #coding:UTF8 2 3 import urllib2 4 5 request = urllib2.Request('http://blog.csdn.com/cqcre') 6 7 try: 8 urllib2.urlopen(request) 9 except urllib2.URLError,e:10 if hasattr(e, 'reason'):11 print e.reason12 else:13 print 'OK'
First, identify the abnormal attribute to avoid attribute output errors.