Python crawlers use cookies to simulate login instances.

Source: Internet
Author: User

Python crawlers use cookies to simulate login instances.

Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions ).

For example, some websites need to log on to the website to obtain the information you want. If you do not log on to the website, you can use the Urllib2 library to save the previously logged-on cookies, load the cookie to get the desired page and then capture it. Understanding cookies is mainly used to quickly simulate login and capture the target webpage.

In my previous post, I used the urlopen () function to open the web page for crawling. This is just a simple Python web page opener. Its parameter is only urlopen (url, data, timeout), these three parameters are far from enough for us to obtain the cookie of the target webpage. At this time, we need to use another Opener -- CookieJar.

Cookielib is also an important module for Python crawler. It can combine with urllib2 to crawl the desired content. The CookieJar class objects of this module can capture cookies and re-Send them in subsequent connection requests, so that we can implement the required simulated login function.

Cookielib is a self-contained module in py2.7 and does not need to be re-installed. To view its built-in module, you can view the Lib folder under the Python Directory, which contains all installed modules. At first, I did not find cookielib in pycharm. I also reported the following error when using quick installation: Couldn't find index page for 'cookielib '(maybe misspelled ?)


After that, I remembered whether it was my own. I didn't expect to go to the lib folder to check whether it was actually a waste of half an hour ~~

Next we will introduce this module. The main objects of this module include CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.

Their relationship: CookieJar -- derived --> FileCookieJar -- derived --> extends illacookiejar and LWPCookieJar are mainly used. We will also discuss them below. The urllib2.urlopen () function does not support authentication, cookie, or other advanced HTTP functions. To support these functions, you must use build_opener () (which can be used to allow python programs to simulate browser access, so you can understand it ~) Function to create a custom Opener object.

1. First, let's get the cookie of the website.

Example:

# Coding = UTF-8 import cookielib import urllib2 mycookie = cookielib. cookieJar () # declare a CookieJar class object to save the cookie (note the case sensitivity issue of CookieJar) handler = urllib2.HTTPCookieProcessor (mycookie) # Use HTTPCookieProcessor in the urllib2 library to declare a cookie processing processor opener = urllib2.build _ opener (handler) # Use handler to construct opener. opener usage is similar to urlopen () response = opener. open ("http://www.baidu.com") # A response object response for item returned by opener in my. cookie: print "name =" + item. name print "value =" + item. value

Result:

name=BAIDUID value=73BD718962A6EA0DAD4CB9578A08FDD0:FG=1 name=BIDUPSID value=73BD718962A6EA0DAD4CB9578A08FDD0 name=H_PS_PSSID value=1450_19035_21122_17001_21454_21409_21394_21377_21526_21189_21398 name=PSTM value=1478834132 name=BDSVRTM value=0 name=BD_HOME value=0 

In this way, we get the simplest cookie.

2. Save the cookie to a file

We get the cookie above. Next we will learn how to save the cookie. Here we use its subclass MozillaCookieJar to save cookies.

Example:

# Coding = UTF-8 import cookielib import urllib2 mycookie = cookielib. mozillaCookieJar () # declare a Class Object of MozillaCookieJar to save the cookie (note the case sensitivity problem of MozillaCookieJar) handler = urllib2.HTTPCookieProcessor (mycookie) # Use HTTPCookieProcessor in the urllib2 library to declare a cookie processing processor opener = urllib2.build _ opener (handler) # Use handler to construct opener. opener usage is similar to urlopen () response = opener. open ("http://www.baidu.com") # A response object response for item in mycookie: print "name =" + item. name print "value =" + item. value filenamepolic'mycookie.txt '# sets the saved file name mycookie. save (filename, ignore_discard = True, ignore_expires = True)

In this example, we can get the simple deformation of the above example. We use the CookieJar subclass MozillaCookiJar. Why? Let's replace MozillaCookiJar with CookieJar. The following figure shows the solution:


CookieJar does not save the save attribute ~

In the save () method: ignore_discard means to save the cookie even if it is discarded. ignore_expires means that if the cookie already exists in the file, overwrite the original file. Here, we set both to True. After the upload, cookieswill be saved to the cookie.txt file. Let's check the content:


In this way, the cookie we want is successfully saved.

3. Obtain and access the cookie from the file

<Pre style = "background-color: rgb (255,255,255); font-family:; font-size: 9pt; "> <pre name =" code "class =" python "> # coding = UTF-8 import urllib2 import cookielib import urllib # The first step is to provide the account password URL to simulate logon to postdata = urllib. urlencode ({'stuid': '000000', 'pwd': 'xxxxxxxxxxx' # The password will not be leaked here, }) loginUrl = 'HTTP: // ids.xidian.edu.cn/authserver/login? Service = http % 3A % 2F % 2f%xt.xidian.edu.cn % 2Fcaslogin. jsp '# log on to the educational administration system URL and query the score URL # Step 2 simulate login and save the logon cookie filename = 'cookie.txt' # create a text save cookie mycookie = cookielib. mozillaCookieJar (filename) # declare an MozillaCookieJar object instance to save the cookie, and then write the file opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (mycookie) # define this opener. The object is cookie result. open (loginUrl, postdata) mycookie. save (ignore_discard = True, ignore_e Xpires = True) # Save cookieto cookie.txt # Step 3 use cookies to access another website. The Master Address of the educational administration system is gradeUrl = 'HTTP: // ids.xidian.edu.cn/authserver/login? Service '# As long as the account password is the same as the web site, the request access score Query Web site result = opener. open (gradeUrl) print result. read () </pre> <br> <pre> </pre> <p> </p> <pre> </ pre> Create an opener with a cookie, when accessing the logon URL, save the cookie after logon and use the cookie to access other URLs. <P> </p> <br> </p> <p> core idea: Create an opener that contains the cookie content. When opener is used, the original cookie is automatically used. <br> </p> </pre>

Thank you for reading this article. I hope it will help you. Thank you for your support for this site!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.