Python crawler URLError Exception Handling, pythonurlerror

Source: Internet
Author: User

Python crawler URLError Exception Handling, pythonurlerror

In this section, we mainly talk about URLError and HTTPError, as well as some processing for them.

1. URLError

First, explain the possible causes of URLError:

  • No network connection, that is, the local machine cannot access the Internet
  • The specified server cannot be connected.
  • The server does not exist.

In the code, we need to use the try-retry t statement to enclose and capture the corresponding exceptions. The following is an example.

import urllib2 requset = urllib2.Request('http://www.xxxxx.com')try:  urllib2.urlopen(requset)except urllib2.URLError, e:  print e.reason

We used the urlopen method to access a nonexistent URL. The running result is as follows:

[Errno 11004] getaddrinfo failed

It indicates that the error code is 11004. The error is caused by getaddrinfo failed.

2. HTTPError

HTTPError is a subclass of URLError. When you send a request using the urlopen method, the server will correspond to a response object response, which contains a number "status code ". For example, if response is a "redirection", You need to locate another address to obtain the document. urllib2 will process this.

Otherwise, the urlopen will generate an HTTPError. is it in the corresponding status? The HTTP status code indicates the status of the response returned by the HTTP protocol. The status code is summarized as follows:

  • 100: the client should continue sending the request. The client should continue to send the remaining part of the request, or ignore this response if the request has been completed.
  • 101: after sending the final blank line of the response, the server will switch to the protocol defined in the Upgrade message header. Similar measures should be taken only when switching to a new protocol is more advantageous.
  • 102: continue to process the status code extended by WebDAV (RFC 2518), which means that the processing will continue.
  • 200: Successful request processing method: Obtain the response content for processing
  • 201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
  • 202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
  • 204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
  • 300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
  • 301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
  • 302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
  • 304: the requested resource is not updated. Processing Method: discard
  • 400: Illegal Request Handling Method: discard
  • 401: Unauthorized handling method: discard
  • 403: forbidden processing method: discard
  • 404: no handling method found: discard
  • 500: internal server error the server encountered an unexpected situation, which caused it to fail to process the request. Generally, this problem occurs when the source code of the server is incorrect.
  • 501: the server cannot identify that the server does not support a function required by the current request. When the server cannot identify the Request Method and cannot support its requests to any resource.
  • 502: when the error gateway acts as a gateway or proxy server and tries to execute a request, it receives an invalid response from the upstream server.
  • 503: the server cannot process requests due to temporary server maintenance or overload. This situation is temporary and will be restored after a period of time.

After an HTTPError instance is generated, a code attribute is generated. This is the error code sent by the server.
Because urllib2 can handle redirection for you, that is, the code starting with 3 can be processed, and the number in the range of-indicates success, you can only see the error number.

Let's write an example to see if the caught exception is HTTPError, which carries a code attribute, which is the error code. In addition, we printed the reason attribute, this is the property of its parent class URLError.

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.HTTPError, e:  print e.code  print e.reason

The running result is as follows:

403Forbidden

The error code is 403. The error is caused by Forbidden, which indicates that access to the server is prohibited.

We know that the parent class of HTTPError is URLError. Based on programming experience, exceptions of the parent class should be written after the child class exception. If the child class cannot be captured, exceptions of the parent class can be captured, so the above Code can be rewritten as follows:

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.HTTPError, e:  print e.codeexcept urllib2.URLError, e:  print e.reasonelse:  print "OK"

If HTTPError is captured, the output code will not process the URLError exception. If the error is not HTTPError, the system will capture the URLError exception and output the cause of the error.

You can also add the hasattr attribute to determine the attribute in advance. The code is rewritten as follows:

import urllib2 req = urllib2.Request('http://blog.csdn.net/cqcre')try:  urllib2.urlopen(req)except urllib2.URLError, e:  if hasattr(e,"code"):    print e.code  if hasattr(e,"reason"):    print e.reasonelse:  print "OK"

First, identify the abnormal attribute to avoid attribute output errors.

The above is the introduction of URLError and HTTPError and the corresponding error handling methods.

Articles you may be interested in:
  • Definition and URL composition of a python crawler without basic writing
  • Using urllib2 to capture webpage content
  • Two important concepts in urllib2: Openers and Handlers
  • Guide to Using urllib2 to write python crawlers without basic knowledge
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Python-based code sharing
  • Python implements simple crawler sharing for crawling links on pages
  • Multi-thread web crawler using python
  • In Python, The urllib + urllib2 + cookielib module write crawler practices
  • A simple example of using the urllib2 module in Python to write Crawlers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.