Summary of how cookies are used in Python web crawler

Last Update:2015-12-18 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

when writing a Python crawler, will we consider the use of cookies in addition to the exception handling of crawlers? When using cookies, have you ever wondered why you should use cookies? Let's take a look.
cookies, which are data that some websites store on the user's local terminal in order to identify the user and track the session (usually encrypted) for example, some websites need to be logged in to access a page, and before you log in, you may not be allowed to crawl a page content. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.
before we do, we must first introduce the concept of a opener.
1.Opener
when you get a URL you use a opener (a urllib2. Openerdirector instances). In front, we are all using the default opener, which is Urlopen. It is a special opener, can be understood as a special example of opener, the incoming parameters are just url,data,timeout.
If we need to use cookies, it is not possible to use this opener, so we need to create more general opener to implement the cookie settings.
2.Cookielib
the primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. The Cookielib module is very powerful, and we can use the object of the Cookiejar class of this module to capture cookies and resend them on subsequent connection requests, such as the ability to implement the impersonation login function. The main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.
Their relationship: cookiejar--derived-->filecookiejar--derived-–>mozillacookiejar and Lwpcookiejar
1) Get cookie saved to variable
First, we first use the Cookiejar object to achieve the function of the cookie, stored in the variable, first to feel the
Import Urllib2
Import Cookielib
#声明一个CookieJar对象实例来保存cookie
cookie = cookielib. Cookiejar ()
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
Handler=urllib2. Httpcookieprocessor (Cookie)
#通过handler来构建opener
opener = Urllib2.build_opener (handler)
#此处的open方法同urllib2的urlopen方法, you can also pass in the request
response = Opener.open (' http://www.baidu.com ')
For item in cookie:
print ' Name = ' +item.name
print ' Value = ' +item.value
we use the above method to save the cookie in the variable, and then print out the value in the cookie, the result is as follows

Name = Baiduid
Value = b07b663b645729f11f659c02aae65b4c:fg=1
Name = Baidupsid
Value = b07b663b645729f11f659c02aae65b4c
Name = H_ps_pssid
Value = 12527_11076_1438_10633
Name = Bdsvrtm
Value = 0
Name = Bd_home
Value = 0
2) Save cookies to file
In the above method, we save the cookie in the cookie variable, what if we want to save the cookie to a file? At this point, we need to use
Filecookiejar This object, where we use its subclass Mozillacookiejar to save cookies.
Import Cookielib
Import Urllib2

#设置保存cookie的文件, cookie.txt in a sibling directory
filename = ' cookie.txt '
#声明一个MozillaCookieJar对象实例来保存cookie, then write the file
cookie = cookielib. Mozillacookiejar (filename)
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
handler = Urllib2. Httpcookieprocessor (Cookie)
#通过handler来构建opener
opener = Urllib2.build_opener (handler)
#创建一个请求, the principle of urlopen with Urllib2
response = Opener.open ("http://www.baidu.com")
#保存cookie到文件
Cookie.save (ignore_discard=true, ignore_expires=true)
the two parameters about the last Save method are described here:
The official explanations are as follows:
Ignore_discard:save Even cookies set to is discarded.
Ignore_expires:save even cookie that has expiredthe file is overwritten if it already exists
Thus, ignore_discard means that even if the cookie will be discarded, it will be saved, ignore_expires means that if the cookie already exists in the file, the original file is overwritten, and here, We set both of these to true. After the operation, the cookies will be saved to the Cookie.txt file, and we'll look at the contents as follows

3) Obtain a cookie from the file and access
so we've already saved the cookie to the file, and if you want to use it later, you can use the following method to read the cookie and visit the website and feel
Import Cookielib
Import Urllib2

#创建MozillaCookieJar实例对象
cookie = cookielib. Mozillacookiejar ()
#从文件中读取cookie内容到变量
cookie.load (' Cookie.txt ', Ignore_discard=true, ignore_expires=true)
#创建请求的request
req = urllib2. Request ("http://www.baidu.com")
#利用urllib2的build_opener方法创建一个opener
opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie))
response = Opener.open (req)
print response.read ()

Imagine, if our cookie.txt file is stored in a person login Baidu cookie, then we extract the contents of this cookie file, you can use the above method to simulate the person's account login Baidu.
4) Use cookies to simulate website Login
below we take the education system of our school as an example, use cookies to realize the simulation login, and save the cookie information to a text file, to feel the cookie Dafa!
Note: The password I changed Ah, don't sneak into the palace of the Elective system O (╯-╰) o
Import Urllib
Import Urllib2
Import Cookielib

filename = ' cookie.txt '
#声明一个MozillaCookieJar对象实例来保存cookie, then write the file
cookie = cookielib. Mozillacookiejar (filename)

opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie))
PostData = Urllib.urlencode ({
' stuid ': ' 201200131012 ',
' pwd ': ' 23342321 '
})
#登录教务系统的URL
loginurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bks_login2.login '
#模拟登录, and save the cookie to the variable
result = Opener.open (loginurl,postdata)
#保存cookie到cookie. txt
Cookie.save (ignore_discard=true, ignore_expires=true)
#利用cookie请求访问另一个网址, this URL is the score query URL
gradeurl = ' Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bkscjcx.curscopre '
#请求访问成绩查询网址
result = Opener.open (Gradeurl)
print result.read ()
the principle of the above procedure is as follows
create a opener with a cookie, save the logged-in cookie when accessing the URL of the login, and then use this cookie to access other URLs.
such as log in to see the results of the query Yes, this semester schedule AH and so on the Web site, the simulation login so realized, is not very cool?
good, little friends to refuel Oh! We can now smooth access to the site information, the next step is to extract the effective content of the site, the next section we will be the regular expression!

Summary of how cookies are used in Python web crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More