The use of cookies for Python crawler entry

Source: Internet
Author: User

In this section, let's take a look at the use of cookies.

Why use cookies?

Cookies, which are data stored on the user's local terminal (usually encrypted) by certain websites in order to identify users and perform session tracking.

For example, some sites need to log in to access a page, before you log in, you want to crawl a page content is not allowed. Then we can use the URLLIB2 library to save our registered cookies, and then crawl the other pages to achieve the goal.

Before we do, we must first introduce the concept of a opener.

1.Opener

When you get a URL you use a opener (a urllib2. Openerdirector instances). In front, we are all using the default opener, which is Urlopen. It is a special opener, can be understood as a special example of opener, the incoming parameters are just url,data,timeout.

If we need to use cookies, it is not possible to use this opener, so we need to create more general opener to implement the cookie settings.

2.Cookielib

The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module. The Cookielib module is very powerful, and we can use the object of the Cookiejar class of this module to capture cookies and resend them on subsequent connection requests, such as the ability to implement the impersonation login function. The main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.

Their relationship: cookiejar--derived-->filecookiejar--derived-–>mozillacookiejar and Lwpcookiejar

   1) Get cookie saved to variable

First, we first use the Cookiejar object to achieve the function of the cookie, stored in the variable, first to feel the

1 ImportUrllib22 ImportCookielib3 #declaring a Cookiejar object instance to hold a cookie4Cookie =Cookielib. Cookiejar ()5 #Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processor6Handler=Urllib2. Httpcookieprocessor (Cookie)7 #build opener with handler8Opener =Urllib2.build_opener (handler)9 #The Open method here is the same as Urllib2 's Urlopen method, which can also be passed to the requestTenResponse = Opener.open ('http://www.baidu.com') One  forIteminchCookies: A     Print 'Name ='+Item.name -     Print 'Value ='+item.value

We use the above method to save the cookie in the variable, and then print out the value in the cookie, the result is as follows

1 name = baiduid  2 Value = b07b663b645729f11f659c02aae65b4c:fg=1 3 name = baidups ID  4 Value = b07b663b645729f11f659c02aae65b4c  5 Name = h_ps_pssid  6 Value = 12527_11076_1438_10633  7 Name = Bdsvrtm  8 Value = 0  9 Name = bd_home Value = 0
   2) Save cookies to file

In the above method, we save the cookie in the cookie variable, what if we want to save the cookie to a file?

At this point, we are going to use the Filecookiejar object, where we use its subclass Mozillacookiejar to save the cookie.

1 ImportCookielib2 ImportUrllib23 4 #set the file that holds the cookie, cookie.txt in the sibling directory5filename ='Cookie.txt'6 #declares a Mozillacookiejar object instance to hold the cookie, and then writes the file7Cookie =Cookielib. Mozillacookiejar (filename)8 #Use the Httpcookieprocessor object of the URLLIB2 library to create a cookie processor9Handler =Urllib2. Httpcookieprocessor (Cookie)Ten #build opener with handler OneOpener =Urllib2.build_opener (handler) A #create a request that works with Urllib2 's Urlopen -Response = Opener.open ("http://www.baidu.com") - #Save cookies to file theCookie.save (Ignore_discard=true, Ignore_expires=true)

The two parameters about the last Save method are described here:

The official explanations are as follows:

Ignore_discard:save even cookies set to is discarded.

Ignore_expires:save even cookie that has expiredthe file is overwritten if it already exists

Thus, ignore_discard means that even if the cookie is discarded, it will be saved, ignore_expires means that if the cookie already exists in the file, overwrite the original file, and here we set both to true. After the operation, the cookies will be saved to the Cookie.txt file, and we'll look at the contents as follows

3) Obtain a cookie from the file and access

So we've already saved the cookie to the file, and if you want to use it later, you can use the following method to read the cookie and visit the website and feel

1 ImportCookielib2 ImportUrllib23 4 #Creating an Mozillacookiejar instance Object5Cookie =Cookielib. Mozillacookiejar ()6 #read cookie content from file to variable7Cookie.load ('Cookie.txt', Ignore_discard=true, ignore_expires=True)8 #to create the requested request9req = Urllib2. Request ("http://www.baidu.com")Ten #use Urllib2 's Build_opener method to create a opener OneOpener =Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) AResponse =Opener.open (req) - PrintResponse.read ()

Imagine, if our cookie.txt file is stored in a person login Baidu cookie, then we extract the contents of this cookie file, you can use the above method to simulate the person's account login Baidu.

4) Use cookies to simulate website Login 

Below we take the education system of our school as an example, use cookies to realize the simulation login, and save the cookie information to a text file, to feel the cookie Dafa!

Note: The password I changed Ah, don't sneak into the palace of the Elective system O (╯-╰) o

1 ImportUrllib2 ImportUrllib23 ImportCookielib4 5filename ='Cookie.txt'6 #declares a Mozillacookiejar object instance to hold the cookie, and then writes the file7Cookie =Cookielib. Mozillacookiejar (filename)8Opener =Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie))9PostData =Urllib.urlencode ({Ten             'Stuid':'201200131012', One             'pwd':'23342321' A         }) - #URL of the login educational system -Loginurl ='Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bks_login2.login' the #impersonate the login and save the cookie to the variable -result =Opener.open (loginurl,postdata) - #Save cookies to Cookie.txt -Cookie.save (Ignore_discard=true, ignore_expires=True) + #use cookies to request access to another URL, which is the score search URL -Gradeurl ='Http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bkscjcx.curscopre' + #request access to the results query URL Aresult =Opener.open (Gradeurl) at PrintResult.read ()
the principle of the above procedure is as follows

Create a opener with a cookie, save the logged-in cookie when accessing the URL of the login, and then use this cookie to access other URLs.

such as log in to see the results of the query Yes, this semester schedule AH and so on the Web site, the simulation login so realized, is not very cool?

Reprinted and collated from: Static find? Python crawler Primer Six use of cookies

The use of cookies for Python crawler entry

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.