Urlerror Exception Handling
Hello everyone, this section here is mainly about Urlerror and httperror, and some of the treatment of them.
1.URLError
First explain the possible causes of Urlerror:
Network is not connected, that is, the computer cannot surf the internet
Cannot connect to a specific server
Server does not exist
In the code, we need to surround and catch the corresponding exception with the try-except statement. Here's an example of how it's going to feel
Import Urllib2
Requset = Urllib2. Request (' http://www.xxxxx.com ')
Try
Urllib2.urlopen (Requset)
Except Urllib2. Urlerror, E:
Print E.reason
We used the Urlopen method to access a nonexistent URL, and the results were as follows:
[Errno 11004] getaddrinfo failed
It shows that the error code is 11004 and the error is getaddrinfo failed
2.HTTPError
Httperror is a subclass of Urlerror, and when you make a request using the Urlopen method, a Reply object response on the server, which contains a number "status code". For example, if response is a "redirect", you need to navigate to a different address to get the document, and URLLIB2 will handle it.
Other can not handle, Urlopen will produce a httperror, corresponding state, HTTP status code represents the status of the response returned by the HTTP protocol. The following state code is summed up as follows:
100: Continue the client should continue to send the request. The client should continue to send the remainder of the request, or ignore the response if the request has been completed.
101: Conversion protocol After sending the last empty line of this response, the server will switch to those protocols defined in the upgrade message header. Similar measures should be taken only when switching to a new protocol is more beneficial.
102: Continue processing the status code extended by WebDAV (RFC 2518), on behalf of processing will be continued to execute.
200: Request Successful processing: Get the content of the response, processing
201: The request is complete, and the result is a new resource was created. The URI of the newly created resource can be processed in the response entity: The crawler will not encounter
202: The request is accepted, but processing has not completed processing: blocking wait
204: The server has implemented the request, but no new information is returned. If the customer is a user agent, you do not need to update your own document view for this. Processing mode: Discard
300: The status code is not used directly by the http/1.0 application, just as the default interpretation of the 3XX type response. There are multiple requested resources available. Processing mode: If the program can be processed, then further processing, if the program can not be processed, then discarded
301: The requested resource is assigned a permanent URL so that it can be accessed in the future through the URL: Redirect to the assigned URL
302: Requested resource is temporarily saved at a different URL processing mode: Redirect to temporary URL
304: Requested resource not updated processing mode: Discard
400: Illegal request processing mode: Discard
401: Unauthorized Handling: Discard
403: Forbidden Handling: Discard
404: No Processing found: Discard
500: The server internal error server encountered an unexpected condition that caused it to be unable to complete the processing of the request. In general, this problem occurs when the source code on the server side is wrong.
501: The server does not recognize that the server does not support a feature that is required for the current request. When the server does not recognize the requested method and cannot support its request for any resource.
502: The error gateway receives an invalid response from the upstream server when it tries to execute the request as a gateway or as a proxy working server.
503: Service error The server is currently unable to process the request due to temporary server maintenance or overloading. This situation is temporary and will be resumed after a certain period of time.
The Httperror instance is generated with a code property, which is the related error number sent by the server.
Because URLLIB2 can handle redirects for you, that is, the code that starts with 3 can be processed, and a 100-299 range number indicates success, so you can only see 400-599 of the error number.
Here we write an example to feel that the catch exception is Httperror, it will have a code property, is the error code, and we have printed the reason property, which is the property of its parent class Urlerror.
Import Urllib2
req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')
Try
Urllib2.urlopen (req)
Except Urllib2. Httperror, E:
Print E.code
Print E.reason
The operation results are as follows
403
Forbidden
The error code is 403 and the error reason is forbidden, which indicates that the server is forbidden.
We know that the parent class of Httperror is Urlerror, according to the programming experience, the exception of the parent class should be written to the subclass exception, if the subclass is not caught, then can catch the exception of the parent class, so the above code can be so rewritten
Import Urllib2
req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')
Try
Urllib2.urlopen (req)
Except Urllib2. Httperror, E:
Print E.code
Except Urllib2. Urlerror, E:
Print E.reason
Else
print "OK"
If Httperror is captured, the code is output and the Urlerror exception is not processed. If it is not httperror, it will catch the Urlerror exception and output the cause of the error.
In addition, you can add the Hasattr attribute in advance to judge the property, the code is rewritten as follows
Import Urllib2
req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')
Try
Urllib2.urlopen (req)
Except Urllib2. Urlerror, E:
If Hasattr (E, "code"):
Print E.code
If Hasattr (E, "Reason"):
Print E.reason
Else
print "OK"
First, the attribute of the exception is judged to avoid the occurrence of an error in the attribute output.
Above, is the Urlerror and Httperror related introduction, and the corresponding error handling methods, small partners refueling!
Python Crawler Primer (6): Use of cookies
Hello, everybody. In the last section we studied the problem of the abnormal handling of reptiles, so let's take a look at the use of cookies.
Why use cookies?
Cookies, which are data stored on the user's local terminal (usually encrypted) by certain websites in order to identify users and perform session tracking.
For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.
Before we do, we must first introduce the concept of a opener.
1.Opener
When you get a URL you use a opener (a urllib2. Openerdirector instances). In front, we are all using the default opener, which is Urlopen. It is a special opener, can be understood as a special example of opener, the incoming parameters are just url,data,timeout.
If we need to use cookies, it is not possible to use this opener, so we need to create more general opener to implement the cookie settings.
2.Cookielib
The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. The Cookielib module is very powerful, and we can use the object of the Cookiejar class of this module to capture cookies and resend them on subsequent connection requests, such as the ability to implement the impersonation login function. The main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.
Their relationship: cookiejar--derived-->filecookiejar--derived-–>mozillacookiejar and Lwpcookiejar
1) Get cookie saved to variable
First, we first use the Cookiejar object to achieve the function of the cookie, stored in the variable, first to feel the
Import Urllib2
Import Cookielib
#声明一个CookieJar对象实例来保存cookie
Cookie = Cookielib. Cookiejar ()
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
Handler=urllib2. Httpcookieprocessor (Cookie)
#通过handler来构建opener
Opener = Urllib2.build_opener (handler)
#此处的open方法同urllib2的urlopen方法, you can also pass in the request
Response = Opener.open (' http://www.baidu.com ')
For item in Cookie:
print ' Name = ' +item.name
print ' Value = ' +item.value
We use the above method to save the cookie in the variable, and then print out the value in the cookie, the result is as follows
Name = Baiduid
Value = b07b663b645729f11f659c02aae65b4c:fg=1
Name = Baidupsid
Value = b07b663b645729f11f659c02aae65b4c
Name = H_ps_pssid
Value = 12527_11076_1438_10633
Name = Bdsvrtm
Value = 0
Name = Bd_home
Value = 0
2) Save cookies to file
In the above method, we save the cookie in the cookie variable, what if we want to save the cookie to a file? At this point, we need to use
Filecookiejar This object, where we use its subclass Mozillacookiejar to save cookies.
Import Cookielib
Import Urllib2
#设置保存cookie的文件, Cookie.txt in a sibling directory
filename = ' Cookie.txt '
#声明一个MozillaCookieJar对象实例来保存cookie, then write the file
Cookie = Cookielib. Mozillacookiejar (filename)
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
Handler = Urllib2. Httpcookieprocessor (Cookie)
#通过handler来构建opener
Opener = Urllib2.build_opener (handler)
#创建一个请求, the principle of urlopen with URLLIB2
Response = Opener.open ("http://www.baidu.com")
#保存cookie到文件
Cookie.save (Ignore_discard=true, Ignore_expires=true)
The two parameters about the last Save method are described here:
The official explanations are as follows:
Ignore_discard:save even cookies set to is discarded.
Ignore_expires:save even cookie that has expiredthe file is overwritten if it already exists
Thus, ignore_discard means that even if the cookie is discarded, it will be saved, ignore_expires means that if the cookie already exists in the file, overwrite the original file, and here we set both to true. After the operation, the cookies will be saved to the Cookie.txt file, and we'll look at the contents as follows
3) Obtain a cookie from the file and access
So we've already saved the cookie to the file, and if you want to use it later, you can use the following method to read the cookie and visit the website and feel
Import Cookielib
Import Urllib2
#创建MozillaCookieJar实例对象
Cookie = Cookielib. Mozillacookiejar ()
#从文件中读取cookie内容到变量
Cookie.load (' Cookie.txt ', Ignore_discard=true, Ignore_expires=true)
#创建请求的request
req = Urllib2. Request ("http://www.baidu.com")
#利用urllib2的build_opener方法创建一个opener
Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))
Response = Opener.open (req)
Print Response.read ()
Imagine, if our cookie.txt file is stored in a person login Baidu cookie, then we extract the contents of this cookie file, you can use the above method to simulate the person's account login Baidu.
4) Use cookies to simulate website Login
Below we take the education system of our school as an example, use cookies to realize the simulation login, and save the cookie information to a text file, to feel the cookie Dafa!
Note: The password I changed Ah, don't sneak into the palace of the Elective system O (╯-╰) o
Import Urllib
Import Urllib2
Import Cookielib
filename = ' Cookie.txt '
#声明一个MozillaCookieJar对象实例来保存cookie, then write the file
Cookie = Cookielib. Mozillacookiejar (filename)
Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))
PostData = Urllib.urlencode ({
' Stuid ': ' 201200131012 ',
' pwd ': ' 23342321 '
})
#登录教务系统的URL
loginurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bks_login2.login '
#模拟登录, and save the cookie to the variable
result = Opener.open (loginurl,postdata)
#保存cookie到cookie. txt
Cookie.save (Ignore_discard=true, Ignore_expires=true)
#利用cookie请求访问另一个网址, this URL is the score query URL
Gradeurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bkscjcx.curscopre '
#请求访问成绩查询网址
result = Opener.open (Gradeurl)
Print Result.read ()
The principle of the above procedure is as follows:
Create a opener with a cookie, save the logged-in cookie when accessing the URL of the login, and then use this cookie to access other URLs.
Python_ Crawler 2