Python crawler uses cookies to make a mock landing instance detailed

Source: Internet
Author: User
Cookies are data (usually encrypted) stored on the user's local terminal by certain websites in order to identify the user and track the session.

For example, some websites need to be logged in to get the information you want, not to login only to the visitor mode, then we can use the URLLIB2 library to save our previously logged in cookies, then load the cookie to get the page we want, and then crawl. Understanding cookies is primarily about getting ready for our Quick Demo Login Crawl landing page.

My previous post used the Urlopen () function to open the web for crawling, just a simple Python web opener with only Urlopen (url,data,timeout) parameters. These three parameters are not enough for our cookie to get the target page. That's when we're going to use another kind of opener--cookiejar.

Cookielib is also an important part of the Python crawler that can be combined with URLLIB2 to crawl the desired content. The object of the Cookiejar class of the module can capture the cookie and resend it on subsequent connection requests, so that we can implement the impersonation login function we need.

In particular, Cookielib is the module that comes with the py2.7, no need to reinstall, and want to see its own module to see the Lib folder in the Python directory, which has all the installed modules. I did not think at first, in the Pycharm did not find Cookielib, using the shortcut installation also error: couldn ' t find index page for ' Cookielib ' (maybe misspelled?)

Later only to remember whether it comes with, did not expect to go to Lib folder a look there really, wasted half an hour of all kinds of blind toss ~ ~

Here we introduce this module, the main objects of the module are Cookiejar, Filecookiejar, Mozillacookiejar, Lwpcookiejar.

Their relationship: cookiejar--derive-->filecookiejar--derivation-–>mozillacookiejar and Lwpcookiejar main usage, as we'll talk about below. The Urllib2.urlopen () function does not support authentication, cookies, or other advanced HTTP features. To support these features, you must create a custom opener object using the Build_opener () function, which allows the Python program to simulate the browser for access and functions as you know it.

1, first of all we will be to obtain the website cookie

Example:

#coding =utf-8 Import cookielib import urllib2   MyCookie = cookielib. Cookiejar () #声明一个CookieJar的类对象保存cookie (note cookiejar case) handler = Urllib2. Httpcookieprocessor (MyCookie) #利用urllib2库中的HTTPCookieProcessor来声明一个处理cookie的处理器 opener = Urllib2.build_opener ( Handler) #利用handler来构造opener, opener usage and Urlopen () similar to response = Opener.open ("http://www.baidu.com") # Opener returns a Reply object response for item in My.cookie:   print "name=" +item.name   print "value=" +item.value

Results:

Name=baiduidvalue=73bd718962a6ea0dad4cb9578a08fdd0:fg=1name=bidupsidvalue=73bd718962a6ea0dad4cb9578a08fdd0name =h_ps_pssidvalue=1450_19035_21122_17001_21454_21409_21394_21377_21526_21189_21398name=pstmvalue=1478834132name =bdsvrtmvalue=0name=bd_homevalue=0

This gives us the simplest cookie.

2. Save the cookie to a file

We have a cookie above and we learn how to save cookies. Here we use its subclass Mozillacookiejar to implement cookie preservation

Example:

#coding =utf-8import cookielibimport urllib2  mycookie = cookielib. Mozillacookiejar () #声明一个MozillaCookieJar的类对象保存cookie (note mozillacookiejar case) handler = Urllib2. Httpcookieprocessor (mycookie) #利用urllib2库中的HTTPCookieProcessor来声明一个处理cookie的处理器opener = Urllib2.build_opener ( Handler) #利用handler来构造opener, opener usage and Urlopen () similar to response = Opener.open ("http://www.baidu.com") # Opener returns a Reply object Responsefor item in MyCookie:  print "name=" +item.name  print "value=" +item.valuefilename= " Mycookie.txt ' #设定保存的文件名mycookie. Save (filename,ignore_discard=true, Ignore_expires=true)

Simply deform the above example to get this example, using the Cookiejar subclass Mozillacookijar, why? We'll change Mozillacookijar to Cookiejar, and here's a picture you can see:

Cookiejar is not saving the Save property ~

Save () This method: Ignore_discard means that even if the cookie will be discarded, it will be saved, ignore_expires means that if the cookie already exists in the file, then overwrite the original file to write, here, We set both of these to true. After running, the cookies will be saved to the Cookie.txt file, and we'll look at the contents:

So we can successfully save the cookie we want.

3. Obtain a cookie from the file and access

<pre style= "Background-color:rgb (255, 255, 255); Font-family: Song body; font-size:9pt; " ><pre name= "code" class= "Python" > #coding =utf-8import urllib2import cookielibimport urllib # The first step is to give the account password URL ready to simulate login PostData = Urllib.urlencode ({' Stuid ': ' 1605122162 ', ' pwd ': ' xxxxxxxxx ' #密码这里就不泄漏啦, Hey Hey}) loginurl = ' http://ids.xidian.edu.cn/authserver/login?service=http%3A%2F%2Fjwxt.xidian.edu.cn%2Fcaslogin.jsp ' # Login to the educational system URL, Results Query URL # The second step is to simulate login and save login cookiefilename = ' cookie.txt ' #创建文本保存cookiemycookie = cookielib. Mozillacookiejar (filename) # declares a Mozillacookiejar object instance to hold the cookie, and then writes the file opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (MyCookie)) #定义这个opener, object is Cookieresult = Opener.open (loginurl, PostData) Mycookie.save (ignore_ Discard=true, Ignore_expires=true) # Save cookies to Cookie.txt # Third step using a cookie to request access to another URL, the educational system general address Gradeurl = ' http// Ids.xidian.edu.cn/authserver/login?service ' #只要是帐号密码一样的网址就可以, request access to results query URL result = Opener.open (gradeurl) print Result.read () </pre><br><pre></pre><p re></pre><p></p><pre></pre><pre></pre> Create a opener with a cookie, When accessing the login URL, save the logged-in cookie and use this cookie to access other URLs. <p></p><p><br></p><p> Core idea: Create a opener that contains the contents of a cookie. Then, when using opener, the original saved cookie.<br><br></p> is automatically used </pre>

Thank you for reading, hope to help everyone, thank you for the support of this site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.