Python crawler Cookie usage, pythoncookie
Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)
For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save the cookies we log on to, and then capture other pages to achieve our goal.
Before that, I would like to introduce the concept of an opener.
1. Opener
When you get a URL, you use an opener (an instance of urllib2.OpenerDirector ). Previously, we used the default opener, Which is urlopen. It is a special opener and can be understood as a special instance of opener. The input parameter is only url, data, and timeout.
If we need to use cookies, we cannot achieve the goal by using only opener. Therefore, we need to create a more general opener to set cookies.
2. Cookielib
The main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources. The Cookielib module is very powerful. We can use the CookieJar class objects of this module to capture cookies and re-Send them during subsequent connection requests. For example, we can implement the simulated login function. Main objects of this module include CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.
Their relationship: CookieJar -- derived --> FileCookieJar -- derived --> extends illacookiejar and LWPCookieJar
1) Get the Cookie and save it to the variable
First, use the CookieJar object to obtain the cookie and store it in the variable.
1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # declare a CookieJar object instance to save cookie 7 cookie = cookielib. cookieJar () 8 # Use the HTTPCookieProcessor object to create cookie processor 9 handle = urllib2.HTTPCookieProcessor (cookie) 10 # Use handle to build opener11 opener = urllib2.build _ opener (handle) 12 # this open method and urllib2 urlopen method can be passed in request13 response = opener. open ('HTTP: // www.baidu.com ') 14 15 for I in cookie: 16 print 'name =' + I. name17 print 'value = '+ I. value
Use the preceding method to save the cookie to the variable and print the value in the cookie. The running result is as follows:
1 Name =BAIDUID 2 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA:FG=1 3 Name =BIDUPSID 4 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA 5 Name =H_PS_PSSID 6 Value = 1465_13550_21110_17001_21672_22158 7 Name =PSTM 8 Value = 1491037392 9 Name =BDSVRTM10 Value = 011 Name =BD_HOME12 Value = 0
2) Save the Cookie to the file
In the above method, we saved the cookie to the cookie variable. What should we do if we want to save the cookie to a file? At this time, we will use
FileCookieJar is an object. Here we use its subclass MozillaCookieJar to save cookies.
1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # Set to save the cookie file, 7 file_namepolic'cookie.txt '8 # declare a CookieJar object instance in the same directory to save cookie 9 cookie = cookielib. mozillaCookieJar (file_name) 10 # Use the HTTPCookieProcessor object to create cookie processor 11 handle = urllib2.HTTPCookieProcessor (cookie) 12 # Use handle to build opener13 opener = urllib2.build _ opener (handle) 14 # this open method and urllib2 urlopen method can be passed in request15 response = opener. open ('HTTP: // www.baidu.com ') 16 17 cookie. save (ignore_discard = True, ignore_expires = True)
The two parameters of the last save method are described here:
The official explanation is as follows:
gnore_discard: save even cookies set to be discarded. ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists
It can be seen that ignore_discard means that even if cookies are discarded, it will be saved. ignore_expires means that if cookies already exist in the file, it will overwrite the original file. Here, we set both to True. After the upload, cookieswill be saved to the cookie.txt file. Let's check the content, as shown in the figure below.
3) Obtain and access the Cookie from the file
We have saved the Cookie to the file. If you want to use it later, you can use the following method to read the cookie and visit the website.
1 # coding: UTF8 2 import urllib2 3 import cookielib 4 # create an instance object 5 cookie = cookielib. mozillaCookieJar () 6 # Read the cookie content from the file to the variable 7 cookie.load('cookie.txt ', ignore_discard = True, ignore_expires = True) 8 # create a request 9 request = urllib2.Request ('HTTP: // www.baidu.com ') 10 # Use the build_opener method to create an opener11 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 12 13 req = opener. open (request) 14 print req. read ()
Imagine that if the cookie.txt file stores a cookie that someone logs on to Baidu, we can extract the content of this cookie file and use the above method to simulate the login of this person's account to Baidu.
4) simulate website login using cookies
The following uses my blog as an example (my account and password are fake. If you don't believe it, you can try it). We use cookies to simulate logon and save the cookie information to a text file, let's take a look at the cookie algorithm!
1 # coding: UTF8 2 3 import urllib 4 import urllib2 5 import cookielib 6 7 file_name = 'cookie1.txt '8 # declare an MozillaCookieJar object instance to save the cookie, and then write the file 9 cookie = cookielib. export illacookiejar (file_name) 10 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 11 data = urllib. urlencode ({12 'username': 'username', 13 'pwd': 'Password', 14}) 15 # log on to URL16 login_url = 'https: // passport.cnblogs.com/user/ Signin? ReturnUrl = http % 3A % 2F % 2Fwww.cnblogs.com % 2F '17 # simulate logon and save the cookie to variable 18 result = opener. open (login_url, data) 19 # Save cookie to file 20 cookie. save (ignore_discard = True, ignore_expires = True) 21 # use cookie requests to access another URL 22 select_url = 'HTTP: // response # request access 24 result = opener. open (select_url) 25 print result. read ()
The principle of the above program is as follows:
Create an opener with a cookie. when accessing the login URL, save the cookie after logon and use the cookie to access other URLs.