Python_ Crawler 2

Last Update:2015-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urlerror Exception Handling

Hello everyone, this section here is mainly about Urlerror and httperror, and some of the treatment of them.

1.URLError

First explain the possible causes of Urlerror:

Network is not connected, that is, the computer cannot surf the internet

Cannot connect to a specific server

Server does not exist

In the code, we need to surround and catch the corresponding exception with the try-except statement. Here's an example of how it's going to feel

Import Urllib2

Requset = Urllib2. Request (' http://www.xxxxx.com ')

Try

Urllib2.urlopen (Requset)

Except Urllib2. Urlerror, E:

Print E.reason

We used the Urlopen method to access a nonexistent URL, and the results were as follows:

[Errno 11004] getaddrinfo failed

It shows that the error code is 11004 and the error is getaddrinfo failed

2.HTTPError

Httperror is a subclass of Urlerror, and when you make a request using the Urlopen method, a Reply object response on the server, which contains a number "status code". For example, if response is a "redirect", you need to navigate to a different address to get the document, and URLLIB2 will handle it.

Other can not handle, Urlopen will produce a httperror, corresponding state, HTTP status code represents the status of the response returned by the HTTP protocol. The following state code is summed up as follows:

100: Continue the client should continue to send the request. The client should continue to send the remainder of the request, or ignore the response if the request has been completed.

101: Conversion protocol After sending the last empty line of this response, the server will switch to those protocols defined in the upgrade message header. Similar measures should be taken only when switching to a new protocol is more beneficial.

102: Continue processing the status code extended by WebDAV (RFC 2518), on behalf of processing will be continued to execute.

200: Request Successful processing: Get the content of the response, processing

201: The request is complete, and the result is a new resource was created. The URI of the newly created resource can be processed in the response entity: The crawler will not encounter

202: The request is accepted, but processing has not completed processing: blocking wait

204: The server has implemented the request, but no new information is returned. If the customer is a user agent, you do not need to update your own document view for this. Processing mode: Discard

300: The status code is not used directly by the http/1.0 application, just as the default interpretation of the 3XX type response. There are multiple requested resources available. Processing mode: If the program can be processed, then further processing, if the program can not be processed, then discarded

301: The requested resource is assigned a permanent URL so that it can be accessed in the future through the URL: Redirect to the assigned URL

302: Requested resource is temporarily saved at a different URL processing mode: Redirect to temporary URL

304: Requested resource not updated processing mode: Discard

400: Illegal request processing mode: Discard

401: Unauthorized Handling: Discard

403: Forbidden Handling: Discard

404: No Processing found: Discard

500: The server internal error server encountered an unexpected condition that caused it to be unable to complete the processing of the request. In general, this problem occurs when the source code on the server side is wrong.

501: The server does not recognize that the server does not support a feature that is required for the current request. When the server does not recognize the requested method and cannot support its request for any resource.

502: The error gateway receives an invalid response from the upstream server when it tries to execute the request as a gateway or as a proxy working server.

503: Service error The server is currently unable to process the request due to temporary server maintenance or overloading. This situation is temporary and will be resumed after a certain period of time.

The Httperror instance is generated with a code property, which is the related error number sent by the server.

Because URLLIB2 can handle redirects for you, that is, the code that starts with 3 can be processed, and a 100-299 range number indicates success, so you can only see 400-599 of the error number.

Here we write an example to feel that the catch exception is Httperror, it will have a code property, is the error code, and we have printed the reason property, which is the property of its parent class Urlerror.

Import Urllib2

req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')

Try

Urllib2.urlopen (req)

Except Urllib2. Httperror, E:

Print E.code

Print E.reason

The operation results are as follows

403

Forbidden

The error code is 403 and the error reason is forbidden, which indicates that the server is forbidden.

We know that the parent class of Httperror is Urlerror, according to the programming experience, the exception of the parent class should be written to the subclass exception, if the subclass is not caught, then can catch the exception of the parent class, so the above code can be so rewritten

Import Urllib2

req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')

Try

Urllib2.urlopen (req)

Except Urllib2. Httperror, E:

Print E.code

Except Urllib2. Urlerror, E:

Print E.reason

Else

print "OK"

If Httperror is captured, the code is output and the Urlerror exception is not processed. If it is not httperror, it will catch the Urlerror exception and output the cause of the error.

In addition, you can add the Hasattr attribute in advance to judge the property, the code is rewritten as follows

Import Urllib2

req = Urllib2. Request (' Http://blog.csdn.net/cqcre ')

Try

Urllib2.urlopen (req)

Except Urllib2. Urlerror, E:

If Hasattr (E, "code"):

Print E.code

If Hasattr (E, "Reason"):

Print E.reason

Else

print "OK"

First, the attribute of the exception is judged to avoid the occurrence of an error in the attribute output.

Above, is the Urlerror and Httperror related introduction, and the corresponding error handling methods, small partners refueling!

Python Crawler Primer (6): Use of cookies

Hello, everybody. In the last section we studied the problem of the abnormal handling of reptiles, so let's take a look at the use of cookies.

Why use cookies?

Cookies, which are data stored on the user's local terminal (usually encrypted) by certain websites in order to identify users and perform session tracking.

For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.

Before we do, we must first introduce the concept of a opener.

1.Opener

When you get a URL you use a opener (a urllib2. Openerdirector instances). In front, we are all using the default opener, which is Urlopen. It is a special opener, can be understood as a special example of opener, the incoming parameters are just url,data,timeout.

If we need to use cookies, it is not possible to use this opener, so we need to create more general opener to implement the cookie settings.

2.Cookielib

The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. The Cookielib module is very powerful, and we can use the object of the Cookiejar class of this module to capture cookies and resend them on subsequent connection requests, such as the ability to implement the impersonation login function. The main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.

Their relationship: cookiejar--derived-->filecookiejar--derived-–>mozillacookiejar and Lwpcookiejar

1) Get cookie saved to variable

First, we first use the Cookiejar object to achieve the function of the cookie, stored in the variable, first to feel the

Import Urllib2

Import Cookielib

#声明一个CookieJar对象实例来保存cookie

Cookie = Cookielib. Cookiejar ()

#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器

Handler=urllib2. Httpcookieprocessor (Cookie)

#通过handler来构建opener

Opener = Urllib2.build_opener (handler)

#此处的open方法同urllib2的urlopen方法, you can also pass in the request

Response = Opener.open (' http://www.baidu.com ')

For item in Cookie:

print ' Name = ' +item.name

print ' Value = ' +item.value

We use the above method to save the cookie in the variable, and then print out the value in the cookie, the result is as follows

Name = Baiduid

Value = b07b663b645729f11f659c02aae65b4c:fg=1

Name = Baidupsid

Value = b07b663b645729f11f659c02aae65b4c

Name = H_ps_pssid

Value = 12527_11076_1438_10633

Name = Bdsvrtm

Value = 0

Name = Bd_home

Value = 0

2) Save cookies to file

In the above method, we save the cookie in the cookie variable, what if we want to save the cookie to a file? At this point, we need to use

Filecookiejar This object, where we use its subclass Mozillacookiejar to save cookies.

Import Cookielib

Import Urllib2

#设置保存cookie的文件, Cookie.txt in a sibling directory

filename = ' Cookie.txt '

#声明一个MozillaCookieJar对象实例来保存cookie, then write the file

Cookie = Cookielib. Mozillacookiejar (filename)

#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器

Handler = Urllib2. Httpcookieprocessor (Cookie)

#通过handler来构建opener

Opener = Urllib2.build_opener (handler)

#创建一个请求, the principle of urlopen with URLLIB2

Response = Opener.open ("http://www.baidu.com")

#保存cookie到文件

Cookie.save (Ignore_discard=true, Ignore_expires=true)

The two parameters about the last Save method are described here:

The official explanations are as follows:

Ignore_discard:save even cookies set to is discarded.

Ignore_expires:save even cookie that has expiredthe file is overwritten if it already exists

Thus, ignore_discard means that even if the cookie is discarded, it will be saved, ignore_expires means that if the cookie already exists in the file, overwrite the original file, and here we set both to true. After the operation, the cookies will be saved to the Cookie.txt file, and we'll look at the contents as follows

3) Obtain a cookie from the file and access

So we've already saved the cookie to the file, and if you want to use it later, you can use the following method to read the cookie and visit the website and feel

Import Cookielib

Import Urllib2

#创建MozillaCookieJar实例对象

Cookie = Cookielib. Mozillacookiejar ()

#从文件中读取cookie内容到变量

Cookie.load (' Cookie.txt ', Ignore_discard=true, Ignore_expires=true)

#创建请求的request

req = Urllib2. Request ("http://www.baidu.com")

#利用urllib2的build_opener方法创建一个opener

Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))

Response = Opener.open (req)

Print Response.read ()

Imagine, if our cookie.txt file is stored in a person login Baidu cookie, then we extract the contents of this cookie file, you can use the above method to simulate the person's account login Baidu.

4) Use cookies to simulate website Login

Below we take the education system of our school as an example, use cookies to realize the simulation login, and save the cookie information to a text file, to feel the cookie Dafa!

Note: The password I changed Ah, don't sneak into the palace of the Elective system O (╯-╰) o

Import Urllib

Import Urllib2

Import Cookielib

filename = ' Cookie.txt '

#声明一个MozillaCookieJar对象实例来保存cookie, then write the file

Cookie = Cookielib. Mozillacookiejar (filename)

Opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookie))

PostData = Urllib.urlencode ({

' Stuid ': ' 201200131012 ',

' pwd ': ' 23342321 '

})

#登录教务系统的URL

loginurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bks_login2.login '

#模拟登录, and save the cookie to the variable

result = Opener.open (loginurl,postdata)

#保存cookie到cookie. txt

Cookie.save (Ignore_discard=true, Ignore_expires=true)

#利用cookie请求访问另一个网址, this URL is the score query URL

Gradeurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bkscjcx.curscopre '

#请求访问成绩查询网址

result = Opener.open (Gradeurl)

Print Result.read ()

The principle of the above procedure is as follows:

Create a opener with a cookie, save the logged-in cookie when accessing the URL of the login, and then use this cookie to access other URLs.

Python_ Crawler 2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More