Python crawler Cookie usage, pythoncookie

Source: Internet
Author: User

Python crawler Cookie usage, pythoncookie

Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)

For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save the cookies we log on to, and then capture other pages to achieve our goal.

Before that, I would like to introduce the concept of an opener.

1. Opener

 

When you get a URL, you use an opener (an instance of urllib2.OpenerDirector ). Previously, we used the default opener, Which is urlopen. It is a special opener and can be understood as a special instance of opener. The input parameter is only url, data, and timeout.

If we need to use cookies, we cannot achieve the goal by using only opener. Therefore, we need to create a more general opener to set cookies.

2. Cookielib

 

The main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources. The Cookielib module is very powerful. We can use the CookieJar class objects of this module to capture cookies and re-Send them during subsequent connection requests. For example, we can implement the simulated login function. Main objects of this module include CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.

Their relationship: CookieJar -- derived --> FileCookieJar -- derived --> extends illacookiejar and LWPCookieJar

1) Get the Cookie and save it to the variable

First, use the CookieJar object to obtain the cookie and store it in the variable.

1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # declare a CookieJar object instance to save cookie 7 cookie = cookielib. cookieJar () 8 # Use the HTTPCookieProcessor object to create cookie processor 9 handle = urllib2.HTTPCookieProcessor (cookie) 10 # Use handle to build opener11 opener = urllib2.build _ opener (handle) 12 # this open method and urllib2 urlopen method can be passed in request13 response = opener. open ('HTTP: // www.baidu.com ') 14 15 for I in cookie: 16 print 'name =' + I. name17 print 'value = '+ I. value

Use the preceding method to save the cookie to the variable and print the value in the cookie. The running result is as follows:

 1 Name =BAIDUID 2 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA:FG=1 3 Name =BIDUPSID 4 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA 5 Name =H_PS_PSSID 6 Value = 1465_13550_21110_17001_21672_22158 7 Name =PSTM 8 Value = 1491037392 9 Name =BDSVRTM10 Value = 011 Name =BD_HOME12 Value = 0
2) Save the Cookie to the file

In the above method, we saved the cookie to the cookie variable. What should we do if we want to save the cookie to a file? At this time, we will use

FileCookieJar is an object. Here we use its subclass MozillaCookieJar to save cookies.

1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # Set to save the cookie file, 7 file_namepolic'cookie.txt '8 # declare a CookieJar object instance in the same directory to save cookie 9 cookie = cookielib. mozillaCookieJar (file_name) 10 # Use the HTTPCookieProcessor object to create cookie processor 11 handle = urllib2.HTTPCookieProcessor (cookie) 12 # Use handle to build opener13 opener = urllib2.build _ opener (handle) 14 # this open method and urllib2 urlopen method can be passed in request15 response = opener. open ('HTTP: // www.baidu.com ') 16 17 cookie. save (ignore_discard = True, ignore_expires = True)

 

The two parameters of the last save method are described here:

The official explanation is as follows:

gnore_discard: save even cookies set to be discarded. ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists

It can be seen that ignore_discard means that even if cookies are discarded, it will be saved. ignore_expires means that if cookies already exist in the file, it will overwrite the original file. Here, we set both to True. After the upload, cookieswill be saved to the cookie.txt file. Let's check the content, as shown in the figure below.

 

 

3) Obtain and access the Cookie from the file

We have saved the Cookie to the file. If you want to use it later, you can use the following method to read the cookie and visit the website.

1 # coding: UTF8 2 import urllib2 3 import cookielib 4 # create an instance object 5 cookie = cookielib. mozillaCookieJar () 6 # Read the cookie content from the file to the variable 7 cookie.load('cookie.txt ', ignore_discard = True, ignore_expires = True) 8 # create a request 9 request = urllib2.Request ('HTTP: // www.baidu.com ') 10 # Use the build_opener method to create an opener11 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 12 13 req = opener. open (request) 14 print req. read ()

 

Imagine that if the cookie.txt file stores a cookie that someone logs on to Baidu, we can extract the content of this cookie file and use the above method to simulate the login of this person's account to Baidu.

4) simulate website login using cookies

The following uses my blog as an example (my account and password are fake. If you don't believe it, you can try it). We use cookies to simulate logon and save the cookie information to a text file, let's take a look at the cookie algorithm!

1 # coding: UTF8 2 3 import urllib 4 import urllib2 5 import cookielib 6 7 file_name = 'cookie1.txt '8 # declare an MozillaCookieJar object instance to save the cookie, and then write the file 9 cookie = cookielib. export illacookiejar (file_name) 10 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 11 data = urllib. urlencode ({12 'username': 'username', 13 'pwd': 'Password', 14}) 15 # log on to URL16 login_url = 'https: // passport.cnblogs.com/user/ Signin? ReturnUrl = http % 3A % 2F % 2Fwww.cnblogs.com % 2F '17 # simulate logon and save the cookie to variable 18 result = opener. open (login_url, data) 19 # Save cookie to file 20 cookie. save (ignore_discard = True, ignore_expires = True) 21 # use cookie requests to access another URL 22 select_url = 'HTTP: // response # request access 24 result = opener. open (select_url) 25 print result. read ()

 

The principle of the above program is as follows:

Create an opener with a cookie. when accessing the login URL, save the cookie after logon and use the cookie to access other URLs.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.