Zero-Basic python crawler-based HTTP Exception Handling and python exception handling
Let's talk about HTTP exception handling.
When urlopen cannot process a response, urlError is generated.
However, Python APIs exceptions such as ValueError and TypeError are also generated at the same time.
HTTPError is a subclass of urlError, which is usually generated in a specific HTTP URLs.
1. URLError
Generally, URLError is generated when there is no network connection (no route to a specific server) or the server does not exist.
In this case, the exception also carries the "reason" attribute, which is a tuple (which can be understood as an unchangeable array ),
Contains an error code and an error message.
Let's create a urllib2_test06.py to handle the exception:
Copy codeThe Code is as follows:
Import urllib2
Req = urllib2.Request ('HTTP: // www.baibai.com ')
Try: urllib2.urlopen (req)
Failed t urllib2.URLError, e:
Print e. reason
Press F5 to see the printed content:
[Errno 11001] getaddrinfo failed
That is to say, the error code is 11001 and the content is getaddrinfo failed.
2. HTTPError
The response of each HTTP response object on the server contains a number "status code ".
Sometimes the status code indicates that the server cannot complete the request. The default processor will process some of these responses for you.
For example, if response is a "redirection", the client needs to obtain the document from another address, and urllib2 will process it for you.
Otherwise, urlopen generates an HTTPError.
Typical errors include "404" (page not found), "403" (request prohibited), and "401" (with verification request ).
The HTTP status code indicates the status of the response returned by the HTTP protocol.
For example, a client sends a request to the server. If the requested resource is successfully obtained, the returned status code is 200, indicating that the response is successful.
If the requested resource does not exist, Error 404 is usually returned.
HTTP status codeIt is usually divided into 5 types, with 1 ~ It starts with five digits and consists of three integers:
Bytes ------------------------------------------------------------------------------------------------
200: Successful request processing method: Obtain the response content for processing
201: the request is complete. The result is that a new resource is created. The URI of the newly created resource can be processed in the response object.
202: the request is accepted, but the processing has not been completed. Processing Method: Blocking wait
204: the server has implemented the request, but no new message is returned. If the customer is a user agent, you do not need to update the document view for this. Processing Method: discard
300: this status code is not directly used by HTTP/1.0 applications, but is used as the default explanation for 3XX type responses. Multiple available requested resources exist. Processing Method: if it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded.
301: all requested resources will be allocated with a permanent URL. In this way, you can use this URL to access this resource in the future. Processing Method: redirect to the allocated URL.
302: the requested resource is temporarily saved in a different URL. Processing Method: redirect to a temporary URL.
304 request resource not Updated Handling Method: discard
400 Illegal Request Handling Method: discard
401 unauthorized handling method: discard
403 Forbidden processing method: discard
404 no handling method found: discard
The status code starting with "5" in the 5XX response code indicates that the server finds an error and cannot continue to execute the request processing method: discard
Bytes ------------------------------------------------------------------------------------------------
After an HTTPError instance is generated, an integer 'code' attribute is generated, indicating the error code sent by the server.
Error Codes Error code
Because the default processor processes redirection (a number other than 300) and the number in the range of-299 indicates success, you can only see the error number.
BaseHTTPServer. BaseHTTPRequestHandler. response is a useful answer number dictionary that displays all the response numbers used by the HTTP protocol.
When an error code is generated, the server returns an HTTP Error code and an error page.
You can use the HTTPError instance as the response object response returned by the page.
This indicates that, like the error attribute, it also contains the read, geturl, and info methods.
Let's create a urllib2_test07.py to feel it:
Copy codeThe Code is as follows:
Import urllib2
Req = urllib2.Request ('HTTP: // www.bkjia.com/callmewhy ')
Try:
Urllib2.urlopen (req)
Failed t urllib2.URLError, e:
Print e. code
# Print e. read ()
Press F5 to see the Error Code 404 output, that is, the page is not found.
3. Wrapping
So if you want to prepare for HTTPError or URLError, there are two basic methods. The second type is recommended.
We will build a urllib2_test08.py to demonstrate the first exception handling solution:
Copy codeThe Code is as follows:
From urllib2 import Request, urlopen, URLError, HTTPError
Req = Request ('HTTP: // www.bkjia.com/callmewhy ')
Try:
Response = urlopen (req)
Counter t HTTPError, e:
Print 'the server couldn \'t fulfill The request .'
Print 'error code: ', e. code
Failed t URLError, e:
Print 'We failed to reach a server .'
Print 'reason: ', e. Reason
Else:
Print 'no exception was raised .'
# Everything is fine
Similar to other languages, try to catch exceptions and print the content.
Note that the primary T HTTPError must be in the first one; otherwise, the primary T URLError will also receive the HTTPError.
Because HTTPError is a subclass of URLError, if URLError is present, it will capture all URLError (including HTTPError ).
We will build a urllib2_test09.py to demonstrate the second exception handling solution:
Copy codeThe Code is as follows:
From urllib2 import Request, urlopen, URLError, HTTPError
Req = Request ('HTTP: // www.bkjia.com/callmewhy ')
Try:
Response = urlopen (req)
Failed t URLError, e:
If hasattr (e, 'code '):
Print 'the server couldn \'t fulfill The request .'
Print 'error code: ', e. code
Elif hasattr (e, 'reason '):
Print 'We failed to reach a server .'
Print 'reason: ', e. Reason
Else:
Print 'no exception was raised .'
# Everything is fine
How to Use python to write crawler programs
Here is a detailed introduction.
Blog.csdn.net/column/details/why-bug.html
Python crawler Problems
Method:
Tools, including your Firefox, can use firebug in Firefox, or IE F12, to debug and analyze the corresponding logic, namely:
Here, how is the data of the buoy generated internally?
After reading the logic, use the code to simulate it.
Answer:
Your address requires you to understand the logic first, then simulate the code, dynamically generate the address, and add the corresponding header (and related post data, and even related cookie information ), can be obtained normally.
For more information about the required data and how to send it, see the following tutorial.
Details:
The specific logic is unclear. After reading my explanation, including the principles and tools, after the sample code,
Then you can solve your own problems here.
For details, see:
First look:
Detailed descriptions of crawling websites, simulated login, and the principles and implementation of capturing dynamic web pages (Python, C #, etc.) can be viewed again: Python Tutorial: crawling websites, simulated login, and capturing dynamic web pages