Python crawler Cookie usage, pythoncookie

Last Update:2017-04-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)

For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save the cookies we log on to, and then capture other pages to achieve our goal.

Before that, I would like to introduce the concept of an opener.

1. Opener

When you get a URL, you use an opener (an instance of urllib2.OpenerDirector ). Previously, we used the default opener, Which is urlopen. It is a special opener and can be understood as a special instance of opener. The input parameter is only url, data, and timeout.

If we need to use cookies, we cannot achieve the goal by using only opener. Therefore, we need to create a more general opener to set cookies.

2. Cookielib

The main function of the cookielib module is to provide objects that can store cookies for use with the urllib2 module to access Internet resources. The Cookielib module is very powerful. We can use the CookieJar class objects of this module to capture cookies and re-Send them during subsequent connection requests. For example, we can implement the simulated login function. Main objects of this module include CookieJar, FileCookieJar, MozillaCookieJar, and LWPCookieJar.

Their relationship: CookieJar -- derived --> FileCookieJar -- derived --> extends illacookiejar and LWPCookieJar

1) Get the Cookie and save it to the variable

First, use the CookieJar object to obtain the cookie and store it in the variable.

1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # declare a CookieJar object instance to save cookie 7 cookie = cookielib. cookieJar () 8 # Use the HTTPCookieProcessor object to create cookie processor 9 handle = urllib2.HTTPCookieProcessor (cookie) 10 # Use handle to build opener11 opener = urllib2.build _ opener (handle) 12 # this open method and urllib2 urlopen method can be passed in request13 response = opener. open ('HTTP: // www.baidu.com ') 14 15 for I in cookie: 16 print 'name =' + I. name17 print 'value = '+ I. value

Use the preceding method to save the cookie to the variable and print the value in the cookie. The running result is as follows:

 1 Name =BAIDUID 2 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA:FG=1 3 Name =BIDUPSID 4 Value = 6E0127B9536DE7EE8A68D8B5AE016CCA 5 Name =H_PS_PSSID 6 Value = 1465_13550_21110_17001_21672_22158 7 Name =PSTM 8 Value = 1491037392 9 Name =BDSVRTM10 Value = 011 Name =BD_HOME12 Value = 0

2) Save the Cookie to the file

In the above method, we saved the cookie to the cookie variable. What should we do if we want to save the cookie to a file? At this time, we will use

FileCookieJar is an object. Here we use its subclass MozillaCookieJar to save cookies.

1 # coding: UTF8 2 3 import cookielib 4 import urllib2 5 6 # Set to save the cookie file, 7 file_namepolic'cookie.txt '8 # declare a CookieJar object instance in the same directory to save cookie 9 cookie = cookielib. mozillaCookieJar (file_name) 10 # Use the HTTPCookieProcessor object to create cookie processor 11 handle = urllib2.HTTPCookieProcessor (cookie) 12 # Use handle to build opener13 opener = urllib2.build _ opener (handle) 14 # this open method and urllib2 urlopen method can be passed in request15 response = opener. open ('HTTP: // www.baidu.com ') 16 17 cookie. save (ignore_discard = True, ignore_expires = True)

The two parameters of the last save method are described here:

The official explanation is as follows:

gnore_discard: save even cookies set to be discarded. ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists

It can be seen that ignore_discard means that even if cookies are discarded, it will be saved. ignore_expires means that if cookies already exist in the file, it will overwrite the original file. Here, we set both to True. After the upload, cookieswill be saved to the cookie.txt file. Let's check the content, as shown in the figure below.

3) Obtain and access the Cookie from the file

We have saved the Cookie to the file. If you want to use it later, you can use the following method to read the cookie and visit the website.

1 # coding: UTF8 2 import urllib2 3 import cookielib 4 # create an instance object 5 cookie = cookielib. mozillaCookieJar () 6 # Read the cookie content from the file to the variable 7 cookie.load('cookie.txt ', ignore_discard = True, ignore_expires = True) 8 # create a request 9 request = urllib2.Request ('HTTP: // www.baidu.com ') 10 # Use the build_opener method to create an opener11 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 12 13 req = opener. open (request) 14 print req. read ()

Imagine that if the cookie.txt file stores a cookie that someone logs on to Baidu, we can extract the content of this cookie file and use the above method to simulate the login of this person's account to Baidu.

4) simulate website login using cookies

The following uses my blog as an example (my account and password are fake. If you don't believe it, you can try it). We use cookies to simulate logon and save the cookie information to a text file, let's take a look at the cookie algorithm!

1 # coding: UTF8 2 3 import urllib 4 import urllib2 5 import cookielib 6 7 file_name = 'cookie1.txt '8 # declare an MozillaCookieJar object instance to save the cookie, and then write the file 9 cookie = cookielib. export illacookiejar (file_name) 10 opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookie) 11 data = urllib. urlencode ({12 'username': 'username', 13 'pwd': 'Password', 14}) 15 # log on to URL16 login_url = 'https: // passport.cnblogs.com/user/ Signin? ReturnUrl = http % 3A % 2F % 2Fwww.cnblogs.com % 2F '17 # simulate logon and save the cookie to variable 18 result = opener. open (login_url, data) 19 # Save cookie to file 20 cookie. save (ignore_discard = True, ignore_expires = True) 21 # use cookie requests to access another URL 22 select_url = 'HTTP: // response # request access 24 result = opener. open (select_url) 25 print result. read ()

The principle of the above program is as follows:

Create an opener with a cookie. when accessing the login URL, save the cookie after logon and use the cookie to access other URLs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Cookie usage, pythoncookie

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler Cookie usage, pythoncookie

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support