[Python] web crawler (3): exception handling and HTTP status code classification

Source: Internet
Author: User
Tags response code
: This article mainly introduces [Python] web crawler (3): exception handling and HTTP status code classification. For more information about PHP tutorials, see. Let's talk about HTTP exception handling.
When urlopen cannot process a response, urlError is generated.
However, Python APIs exceptions such as ValueError and TypeError are also generated at the same time.
HTTPError is a subclass of urlError, which is usually generated in a specific HTTP URLs.

1. URLError
Generally, URLError is generated when there is no network connection (no route to a specific server) or the server does not exist.

In this case, the exception also carries the "reason" attribute, which is a tuple (which can be understood as an unchangeable array ),

Contains an error code and an error message.

Let's create a urllib2_test06.py to handle the exception:

[Python] view plaincopy

  1. Import urllib2
  2. Req = urllib2.Request ('http: // www.baibai.com ')
  3. Try: urllib2.urlopen (req)
  4. Failed T urllib2.URLError, e:
  5. Print e. reason


Press F5 to see the printed content:

[Errno 11001] getaddrinfo failed

That is to say, the error code is 11001 and the content is getaddrinfo failed.


2. HTTPError
The response of each HTTP response object on the server contains a number "status code ".

Sometimes the status code indicates that the server cannot complete the request. The default processor will process some of these responses for you.

For example, if response is a "redirection", the client needs to obtain the document from another address, and urllib2 will process it for you.

Otherwise, urlopen generates an HTTPError.

Typical errors include "404" (page not found), "403" (request prohibited), and "401" (with verification request ).

The HTTP status code indicates the status of the response returned by the HTTP protocol.

For example, a client sends a request to the server. if the requested resource is successfully obtained, the returned status code is 200, indicating that the response is successful.

If the requested resource does not exist, error 404 is usually returned.

HTTP status codes are generally divided into 5 types, with 1 ~ It starts with five digits and consists of three integers:

Bytes ------------------------------------------------------------------------------------------------

200: Successful request processing method: obtain the response content for processing

201: The request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.

202: The request is accepted, but the processing has not been completed. processing method: blocking wait

204: The server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: Discard

300: This status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. if it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. in this way, you can use this URL to access this resource in the future. processing method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. processing method: redirect to a temporary URL.

304 Request resource not updated handling method: Discard

400 illegal request handling method: Discard

401 unauthorized handling method: Discard

403 forbidden processing method: Discard

404 no handling method found: Discard

The status code starting with "5" in The 5XX response code indicates that the server finds an error and cannot continue to execute the request processing method: Discard

Bytes ------------------------------------------------------------------------------------------------

After an HTTPError instance is generated, an integer 'code' attribute is generated, indicating the error code sent by the server.

Error Codes Error code
Because the default processor processes redirection (a number other than 300) and the number in the range of-299 indicates success, you can only see the error number.
BaseHTTPServer. BaseHTTPRequestHandler. response is a useful answer number dictionary that displays all the response numbers used by the HTTP protocol.

When an error code is generated, the server returns an HTTP error code and an error page.

You can use the HTTPError instance as the response object response returned by the page.

This indicates that, like the error attribute, it also contains the read, geturl, and info methods.

Let's create a urllib2_test07.py to feel it:

[Python] view plaincopy

  1. Import urllib2
  2. Req = urllib2.Request ('http: // bbs.csdn.net/callmewhy ')
  3. Try:
  4. Urllib2.urlopen (req)
  5. Failed T urllib2.URLError, e:
  6. Print e. code
  7. # Print e. read ()


Press F5 to see the error code 404 output, that is, the page is not found.


3. Wrapping

So if you want to prepare for HTTPError or URLError, there are two basic methods. The second type is recommended.

We will build a urllib2_test08.py to demonstrate the first exception handling solution:

[Python] view plaincopy

  1. From urllib2 import Request, urlopen, URLError, HTTPError
  2. Req = Request ('http: // bbs.csdn.net/callmewhy ')
  3. Try:
  4. Response = urlopen (req)
  5. Counter T HTTPError, e:
  6. Print 'The server couldn \'t fulfill The request .'
  7. Print 'error code: ', e. code
  8. Failed T URLError, e:
  9. Print 'We failed to reach a server .'
  10. Print 'Reason: ', e. Reason
  11. Else:
  12. Print 'no exception was raised .'
  13. # Everything is fine


Similar to other languages, try to catch exceptions and print the content.

Note that the primary T HTTPError must be in the first one; otherwise, the primary T URLError will also receive the HTTPError.
Because HTTPError is a subclass of URLError, if URLError is present, it will capture all URLError (including HTTPError ).



We will build a urllib2_test09.py to demonstrate the second exception handling solution:

[Python] view plaincopy

  1. From urllib2 import Request, urlopen, URLError, HTTPError
  2. Req = Request ('http: // bbs.csdn.net/callmewhy ')
  3. Try:
  4. Response = urlopen (req)
  5. Failed T URLError, e:
  6. If hasattr (e, 'code '):
  7. Print 'The server couldn \'t fulfill The request .'
  8. Print 'error code: ', e. code
  9. Elif hasattr (e, 'reason '):
  10. Print 'We failed to reach a server .'
  11. Print 'Reason: ', e. Reason
  12. Else:
  13. Print 'no exception was raised .'
  14. # Everything is fine

The above introduces [Python] web crawler (3): exception handling and HTTP status code classification, including content, and hope to be helpful to friends who are interested in PHP tutorials.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.