Python crawler cookie usage, pythoncookie
In the previous article, we learned about crawler Exception Handling, so let's take a look at how to use cookies.
Why use cookies?
Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)
For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save the cookies we log on to, and then capture other pages to achieve our goal.
Before that, we must first introduce an opener concept.
1. Opener
When you get a URL, you use an opener (an instance of urllib2.OpenerDirector ). Previously, we used the default opener, Which is urlopen. It is a special opener and can be understood as a special instance of opener. The input parameter is only url, data, and timeout.
If we need to use cookies, we cannot achieve the goal by using only opener. Therefore, we need to create a more general opener to set cookies.
2. Cookielib
The main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources. The Cookielib module is very powerful. We can use the CookieJar class objects of this module to capture cookies and re-Send them during subsequent connection requests. For example, we can implement the simulated login function. Main objects of this module include CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.
Their relationship: CookieJar -- derived --> FileCookieJar -- derived --> extends illacookiejar and LWPCookieJar
1) Get the Cookie and save it to the variable
First, use the CookieJar object to obtain the cookie and store it in the variable.
Import urllib2import cookielib # declare a CookieJar object instance to save cookiecookie = cookielib. cookieJar () # Use the HTTPCookieProcessor object of the urllib2 library to create the cookie processor handler = urllib2.HTTPCookieProcessor (cookie) # Use handler to build openeropener = urllib2.build _ opener (handler) # The open method here is the same as the urlopen method of urllib2. You can also input requestresponse = opener. open ('HTTP: // www.baidu.com ') for item in cookie: print 'name =' + item. name print 'value = '+ item. value
We use the above method to save the cookie to the variable, and then print the value in the cookie. The running result is as follows:
Name = BAIDUIDValue = B07B663B645729F11F659C02AAE65B4C:FG=1Name = BAIDUPSIDValue = B07B663B645729F11F659C02AAE65B4CName = H_PS_PSSIDValue = 12527_11076_1438_10633Name = BDSVRTMValue = 0Name = BD_HOMEValue = 0
2) Save the Cookie to the file
In the above method, we saved the cookie to the cookie variable. What should we do if we want to save the cookie to a file? At this time, we will use
FileCookieJar is an object. Here we use its subclass MozillaCookieJar to save cookies.
Import cookielibimport urllib2. Set cookie.txt filename = 'cookie.txt 'in the same directory to save the cookie, and then write the file cookie = cookielib. mozillaCookieJar (filename) # Use the HTTPCookieProcessor object of the urllib2 library to create cookie processor handler = handler (cookie) # Build openeropener = urllib2.build _ opener (handler) through handler # create a request, the principle is the same as urlopenresponse = opener of urllib2. open ("http://www.baidu.com") # Save the cookie to the file cookie. save (ignore_discard = True, ignore_expires = True)
The two parameters of the last save method are described here:
The official explanation is as follows:
Ignore_discard: save even cookies set to be discarded.
Ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists
It can be seen that ignore_discard means that even if cookies are discarded, it will be saved. ignore_expires means that if cookies already exist in the file, it will overwrite the original file. Here, we set both to True. After the upload, cookieswill be saved to the cookie.txt file. Let's check the content, as shown in the figure below.
3) Obtain and access the Cookie from the file
We have saved the Cookie to the file. If you want to use it later, you can use the following method to read the cookie and visit the website.
Import cookielibimport urllib2 # create the MozillaCookieJar Instance Object cookie = upload', ignore_discard = True, ignore_expires = True) # create a request for requestreq = urllib2.Request ("http://www.baidu.com ") # Use the build_opener method of urllib2 to create an openeropener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) response = opener. open (req) print response. read ()
Imagine that if the cookie.txt file stores a cookie that someone logs on to Baidu, we can extract the content of this cookie file and use the above method to simulate the login of this person's account to Baidu.
4) simulate website login using cookies
Next we will take our school's education system as an example. We will use cookies to simulate login and save cookie information to text files. Let's take a look at the cookie algorithm!
Note: I have changed the password. Do not log on to the course selection system o (optional) o of the palace.
Import urllibimport urllib2import cookielib filename = 'cookie.txt '# declare an MozillaCookieJar object instance to save the cookie, and then write the file cookie = cookielib. export illacookiejar (filename) opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) postdata = urllib. urlencode ({'stuid': '000000', 'pwd': '000000'}) # log on to the educational administration system URLloginUrl = 'HTTP: // jwxt.sdu.edu.cn: 7890/pls/wwwbks/bks_login2.login '# simulate logon, and save the cookie to the variable result = opener.open(loginurl,postdata1_1_save cookieto cookie.txt. save (ignore_discard = True, ignore_expires = True) # use cookies to access another website. This website is the score query URL gradeUrl = 'HTTP: // jwxt.sdu.edu.cn: 7890/pls/wwwbks/bkscjcx. curscopre '# request access score query URL result = opener. open (gradeUrl) print result. read ()
The principle of the above program is as follows:
Create an opener with a cookie. when accessing the login URL, save the cookie after logon and use the cookie to access other URLs.
For example, you can view the query results only after logging on to the course list for this semester, and so on. It's just like simulating logon. Is it cool?
Okay, come on! Now we can get the website information smoothly. The next step is to extract the valid content from the website. In the next article, we will see a regular expression!
Articles you may be interested in:
- Zero-Basic python crawler-based HTTP Exception Handling
- Guide to Using urllib2 to write python crawlers without basic knowledge
- Full record of crawler writing for python Crawlers
- No basic write python crawler: Use Scrapy framework to write Crawlers
- Python-based code sharing
- Python implements simple crawler sharing for crawling links on pages
- Python3 simple Crawler
- Write crawler programs in python
- Python makes simple Web Crawler
- In Python, The urllib + urllib2 + cookielib module write crawler practices