[Python] web crawler (iii): Exception handling and classification of HTTP status codes

Last Update:2016-08-08 Source: Internet

Author: User

Tags response code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the HTTP exception handling problem.
When Urlopen is not able to handle a response, Urlerror is generated.
However, the usual Python APIs, such as Valueerror,typeerror, will also be generated at the same time.
Httperror is a subclass of Urlerror, usually generated in a specific HTTP URLs.

1.URLError
Typically, urlerror occurs when there is no network connection (no routing to a particular server), or if the server does not exist.

In this case, the exception will also have the "reason" attribute, which is a tuple (which can be understood as an immutable array).

Contains an error number and an error message.

Let's build a urllib2_test06.py to feel the unusual handling:

[Python] View plaincopy

Import Urllib2
req = Urllib2. Request (' http://www.baibai.com ')
Try : Urllib2.urlopen (req)
except Urllib2. Urlerror, E:
Print E.reason

By pressing F5, you can see that the printed content is:

[Errno 11001] getaddrinfo failed

In other words, the error number is 11001 and the content is getaddrinfo failed

2.HTTPError
Each HTTP reply object on the server response contains a number "status code".

Sometimes the status code indicates that the server cannot complete the request. The default processor will handle a portion of this response for you.

For example, if response is a "redirect" that requires the client to obtain a document from another address, URLLIB2 will handle it for you.

Other can not handle, Urlopen will produce a httperror.

Typical errors include "404" (Page cannot be found), "403" (Request forbidden), and "401" (with authentication request).

The HTTP status code represents the status of the response returned by the HTTP protocol.

For example, the client sends a request to the server, and if it succeeds in obtaining the requested resource, the returned status code is 200, indicating a successful response.

If the requested resource does not exist, a 404 error is typically returned.

HTTP status codes are usually divided into 5 types, starting with a five-digit, 3-bit integer:

------------------------------------------------------------------------------------------------

200: Request Successful processing: Get the content of the response, processing

201: The request is complete, and the result is a new resource was created. The URI of the newly created resource can be processed in the response entity: The crawler will not encounter

202: The request is accepted, but processing has not completed processing: blocking wait

204: The server has implemented the request, but no new information is returned. If the customer is a user agent, you do not need to update your own document view for this. Processing mode: Discard

300: The status code is not used directly by the http/1.0 application, just as the default interpretation of the 3XX type response. There are multiple requested resources available. Processing mode: If the program can be processed, then further processing, if the program can not be processed, then discarded
301: The requested resource is assigned a permanent URL so that it can be accessed in the future through the URL: Redirect to the assigned URL
302: Requested resource is temporarily saved at a different URL processing mode: Redirect to temporary URL

304 The requested resource is not updated for processing: Discard

400 Illegal request processing mode: Discard

401 Unauthorized Handling: Discard

403 Prohibited Handling: Discard

404 No Processing found: Discard

5XX response code starting with "5" status code indicates that the server side found itself error, cannot continue to execute request processing mode: Discard

------------------------------------------------------------------------------------------------

The Httperror instance is generated with an integer ' code ' attribute, which is the associated error number sent by the server.

Error codes wrong code
Because the default processor handles redirects (300 + numbers), and a 100-299 range number indicates success, you can only see 400-599 of the error number.
BaseHTTPServer.BaseHTTPRequestHandler.response is a useful answer number dictionary that shows all the answer numbers used by the HTTP protocol.

When an error number is generated, the server returns an HTTP error number, and an error page.

You can use the Httperror instance as the Reply object response returned by the page.

This represents the same as the error property, which also contains the Read,geturl, and the info method.

Let's build a urllib2_test07.py to feel:

[Python] View plaincopy

Import Urllib2
req = Urllib2. Request (' http://bbs.csdn.net/callmewhy ')
Try :
Urllib2.urlopen (req)
except Urllib2. Urlerror, E:
Print E.code
#print e.read ()

Press F5 to see the error code that was output 404, and said that the page could not be found.

3.Wrapping

So if you want to prepare for httperror or urlerror, there are two basic ways. The second type is recommended.

Let's build a urllib2_test08.py to demonstrate the first exception-handling scenario:

[Python] View plaincopy

from Urllib2 Import Request, Urlopen, Urlerror, Httperror
req = Request (' http://bbs.csdn.net/callmewhy ')
Try :
Response = Urlopen (req)
except Httperror, E:
Print ' the server couldn\ ' t fulfill the request. '
Print ' Error code: ', E.code
except Urlerror, E:
Print ' We failed to reach a server. '
Print ' Reason: ', E.reason
Else :
Print ' No exception was raised. '
# everything is fine

Similar to other languages, a try catches an exception and prints its contents.

One thing to note here is that except httperror must be in the first, otherwise except Urlerror will also be accepted to Httperror .
Because Httperror is a subclass of Urlerror, if Urlerror is in front it will catch all urlerror (including Httperror).

Let's build a urllib2_test09.py to demonstrate the second exception handling scenario:

[Python] View plaincopy

from Urllib2 Import Request, Urlopen, Urlerror, Httperror
req = Request (' http://bbs.csdn.net/callmewhy ')
Try :
Response = Urlopen (req)
except Urlerror, E:
if hasattr (E, ' code '):
Print ' the server couldn\ ' t fulfill the request. '
Print ' Error code: ', E.code
elif hasattr (E, ' reason '):
Print ' We failed to reach a server. '
Print ' Reason: ', E.reason
Else :
Print ' No exception was raised. '
# everything is fine

The above describes the [Python] web crawler (iii): Exception handling and HTTP status code classification, including aspects of the content, I hope to be interested in PHP tutorial friends helpful.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More