[Python] web crawler (iii): Exception handling and classification of HTTP status codes

Source: Internet
Author: User
Tags response code
First, the HTTP exception handling problem.
When Urlopen is not able to handle a response, Urlerror is generated.
However, the usual Python APIs, such as Valueerror,typeerror, will also be generated at the same time.
Httperror is a subclass of Urlerror, usually generated in a specific HTTP URLs.

1.URLError
Typically, urlerror occurs when there is no network connection (no routing to a particular server), or if the server does not exist.

In this case, the exception will also have the "reason" attribute, which is a tuple (which can be understood as an immutable array).

Contains an error number and an error message.

Let's build a urllib2_test06.py to feel the unusual handling:

[Python] View plaincopy

    1. Import Urllib2
    2. req = Urllib2. Request (' http://www.baibai.com ')
    3. Try : Urllib2.urlopen (req)
    4. except Urllib2. Urlerror, E:
    5. Print E.reason


By pressing F5, you can see that the printed content is:

[Errno 11001] getaddrinfo failed

In other words, the error number is 11001 and the content is getaddrinfo failed


2.HTTPError
Each HTTP reply object on the server response contains a number "status code".

Sometimes the status code indicates that the server cannot complete the request. The default processor will handle a portion of this response for you.

For example, if response is a "redirect" that requires the client to obtain a document from another address, URLLIB2 will handle it for you.

Other can not handle, Urlopen will produce a httperror.

Typical errors include "404" (Page cannot be found), "403" (Request forbidden), and "401" (with authentication request).

The HTTP status code represents the status of the response returned by the HTTP protocol.

For example, the client sends a request to the server, and if it succeeds in obtaining the requested resource, the returned status code is 200, indicating a successful response.

If the requested resource does not exist, a 404 error is typically returned.

HTTP status codes are usually divided into 5 types, starting with a five-digit, 3-bit integer:

------------------------------------------------------------------------------------------------

200: Request Successful processing: Get the content of the response, processing

201: The request is complete, and the result is a new resource was created. The URI of the newly created resource can be processed in the response entity: The crawler will not encounter

202: The request is accepted, but processing has not completed processing: blocking wait

204: The server has implemented the request, but no new information is returned. If the customer is a user agent, you do not need to update your own document view for this. Processing mode: Discard

300: The status code is not used directly by the http/1.0 application, just as the default interpretation of the 3XX type response. There are multiple requested resources available. Processing mode: If the program can be processed, then further processing, if the program can not be processed, then discarded
301: The requested resource is assigned a permanent URL so that it can be accessed in the future through the URL: Redirect to the assigned URL
302: Requested resource is temporarily saved at a different URL processing mode: Redirect to temporary URL

304 The requested resource is not updated for processing: Discard

400 Illegal request processing mode: Discard

401 Unauthorized Handling: Discard

403 Prohibited Handling: Discard

404 No Processing found: Discard

5XX response code starting with "5" status code indicates that the server side found itself error, cannot continue to execute request processing mode: Discard

------------------------------------------------------------------------------------------------

The Httperror instance is generated with an integer ' code ' attribute, which is the associated error number sent by the server.

Error codes wrong code
Because the default processor handles redirects (300 + numbers), and a 100-299 range number indicates success, you can only see 400-599 of the error number.
BaseHTTPServer.BaseHTTPRequestHandler.response is a useful answer number dictionary that shows all the answer numbers used by the HTTP protocol.

When an error number is generated, the server returns an HTTP error number, and an error page.

You can use the Httperror instance as the Reply object response returned by the page.

This represents the same as the error property, which also contains the Read,geturl, and the info method.

Let's build a urllib2_test07.py to feel:

[Python] View plaincopy

    1. Import Urllib2
    2. req = Urllib2. Request (' http://bbs.csdn.net/callmewhy ')
    3. Try :
    4. Urllib2.urlopen (req)
    5. except Urllib2. Urlerror, E:
    6. Print E.code
    7. #print e.read ()


Press F5 to see the error code that was output 404, and said that the page could not be found.


3.Wrapping

So if you want to prepare for httperror or urlerror, there are two basic ways. The second type is recommended.

Let's build a urllib2_test08.py to demonstrate the first exception-handling scenario:

[Python] View plaincopy

  1. from Urllib2 Import Request, Urlopen, Urlerror, Httperror
  2. req = Request (' http://bbs.csdn.net/callmewhy ')
  3. Try :
  4. Response = Urlopen (req)
  5. except Httperror, E:
  6. Print ' the server couldn\ ' t fulfill the request. '
  7. Print ' Error code: ', E.code
  8. except Urlerror, E:
  9. Print ' We failed to reach a server. '
  10. Print ' Reason: ', E.reason
  11. Else :
  12. Print ' No exception was raised. '
  13. # everything is fine


Similar to other languages, a try catches an exception and prints its contents.

One thing to note here is that except httperror must be in the first, otherwise except Urlerror will also be accepted to Httperror .
Because Httperror is a subclass of Urlerror, if Urlerror is in front it will catch all urlerror (including Httperror).



Let's build a urllib2_test09.py to demonstrate the second exception handling scenario:

[Python] View plaincopy

  1. from Urllib2 Import Request, Urlopen, Urlerror, Httperror
  2. req = Request (' http://bbs.csdn.net/callmewhy ')
  3. Try :
  4. Response = Urlopen (req)
  5. except Urlerror, E:
  6. if hasattr (E, ' code '):
  7. Print ' the server couldn\ ' t fulfill the request. '
  8. Print ' Error code: ', E.code
  9. elif hasattr (E, ' reason '):
  10. Print ' We failed to reach a server. '
  11. Print ' Reason: ', E.reason
  12. Else :
  13. Print ' No exception was raised. '
  14. # everything is fine

The above describes the [Python] web crawler (iii): Exception handling and HTTP status code classification, including aspects of the content, I hope to be interested in PHP tutorial friends helpful.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.