Python web crawler use scrapy automatic login website

Last Update:2017-10-18 Source: Internet

Author: User

Tags session id python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous introduction of requests implementation of automatic login method. This article describes how to use Scrapy to implement automatic login. Or take the CSDN website as an example.

Scrapy uses formrequest to log in and submit data to the server. Just with the extra formdata parameter used to transfer the login form information (username and password), in order to use this class, the following statement is required to import: from scrapy.http import formrequest

So with regard to the use of cookie values during login, Scrapy will automatically process cookies for us, as long as we are successful, it will automatically transfer cookies like a browser.

First, the crawler defines start_requests

Start_requests (self):

[Request ("Http://passport.csdn.net/account/login", meta={' Cookiejar ': 1},callback=self.post_ login,method="POST")]

Where the requests method is used to first access the login site. The meta attribute is a dictionary, the dictionary format is {' key ': ' Value '}, and the dictionary is a mutable container model that can store any type of object.

The purpose of the meta parameter in request is to pass information to the next function, which can be any type, such as a value, a string, a list, a dictionary ... The method is to assign the information to be passed to the Meta dictionary key. The above start_requests key ' Cookiejar ' is a special key, Scrapy will automatically pass the cookie to the function to be callback after seeing this key in Meta. Since it is a key, it is necessary to have a value corresponding to it, the example gives the number 1, or it can be another value, such as any one string.

Callback is the function that needs to be tuned after connecting to the login site. Let's see how Post_login is implemented

def post_login (self,response):
    Html=beautifulsoup (Response.text, "Html.parser")
    For input in Html.find_all (' input '):
        If ' name ' in Input.attrs and input.attrs[' name '] = = ' LT ':
            lt=input.attrs[' value ']
        If ' name ' in Input.attrs and input.attrs[' name '] = = ' execution ':
            e1=input.attrs[' value ']
    data={' username ': ' xxxx ', ' Password ': ' xxxxx ', ' lt ': lt, ' execution ': E1, ' _eventid ': ' Submit '}
    return [Formrequest.from_response (response,
                                      meta={' Cookiejar ': response.meta[' Cookiejar '},
                                      Headers=self.header,
                                      Formdata=data,
                                      Callback=self.after_login,)]

The first is to get the value of the Lt,execution field, which is explained in the previous post that introduced requests.

Then call Formrequest.from_response. The function of this method is to construct the form data from the page returned in response, so the first parameter is response. The page returned here response the Http://passport.csdn.net/account/login that was called in the previous requests

The next parameter is meta. Heasers,formadata and callback. The callback here is a function that points to login.

The implementation of After_login is as follows.

def after_login (self,response):
    print ' After login '
    Print Response.Status
    header={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/ 537.36 '}
    return [Request ("Http://my.csdn.net/my/mycsdn", meta={' Cookiejar ': response.meta[' Cookiejar ']},headers=header, Callback=self.parse)]
Def parse (self, Response):
    Print Response.text.decode (' utf-8 '). Encode (Self.type)

After the run we look at the log of the log . You can see it from the red-labeled Red section below. After initiating a login request to http://passport.csdn.net/account/login ,scrapy immediately

initiate a request to HTTP://PASSPORT.CSDN.NET/ACCOUNT/LOGIN;JSESSIONID=8B4A62EA09BBB5F1FBF4D921B64FECE6.TOMCAT2. This is called Formrequest.from_response trigger. The Jsessionid value in the back of the link is the session ID that was acquired before accessing the login site , where Scrapy is added automatically.

2017-10-16 22:17:34 [Scrapy] Info:spider opened
2017-10-16 22:17:34 [scrapy] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-16 22:17:34 [scrapy] debug:telnet console listening on 127.0.0.1:6023
2017-10-16 22:17:34 [scrapy] debug:crawled (404) <get http://passport.csdn.net/robots.txt> (Referer:none)
 2017-10-16 22:17:34 [Scrapy] debug:crawled ($) <post http://passport.csdn.net/account/login> (referer : None)
2017-10-16 22:17:34 [Scrapy] debug:crawled ($) <post http://passport.csdn.net/account/login;jsessionid= 8b4a62ea09bbb5f1fbf4d921b64fece6.tomcat2> (referer:http://www.csdn.net/)
2017-10-16 22:17:35 [Scrapy] debug:crawled ($) <get http://my.csdn.net/robots.txt> (Referer:none)
2017-10-16 22:17:35 [Scrapy] debug:crawled ($) <get http://my.csdn.net/my/mycsdn> (referer:http:// PASSPORT.CSDN.NET/ACCOUNT/LOGIN;JSESSIONID=8B4A62EA09BBB5F1FBF4D921B64FECE6.TOMCAT2)
2017-10-16 22:17:35 [scrapy] info:closing spider (finished)
2017-10-16 22:17:35 [scrapy] info:dumping scrapy stats:
{' downloader/request_bytes ': 2022,

data crawled from fiddler can be seen Jsessionid is the header message of the response message returned by the Web site after accessing the login site  . That is, the cookie value set by the website

The complete code:

#-*-Coding:utf-8-*-#
From scrapy.spiders import Spider,crawlspider,rule
From Scrapy.selector import Selector
From scrapy.http import Request
From scrapy import formrequest

From Test2.items import Test2item
From Scrapy.utils.response import Open_in_browser
From scrapy.linkextractors import Linkextractor
From BS4 import BeautifulSoup
Import Sys


Class Testspider (Spider):
Name= "Test2"
allowd_domains=[' http://www.csdn.net/']
header={' host ': ' Passport.csdn.net ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/537.36 ', ' Referer ': ' http://www.csdn.net/'}
start_urls=["http://www.csdn.net/"]
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
Type = Sys.getfilesystemencoding ()
def start_requests (self):

return [Request ("Http://passport.csdn.net/account/login", meta={' Cookiejar ': 1},callback=self.post_login,method= " POST ")]

def post_login (self,response):
Html=beautifulsoup (Response.text, "Html.parser")
For input in Html.find_all (' input '):
If ' name ' in Input.attrs and input.attrs[' name '] = = ' LT ':
lt=input.attrs[' value ']
If ' name ' in Input.attrs and input.attrs[' name '] = = ' execution ':
e1=input.attrs[' value ']
data={' username ': ' xxxx ', ' Password ': ' xxxxx ', ' lt ': lt, ' execution ': E1, ' _eventid ': ' Submit '}
return [Formrequest.from_response (response,
meta={' Cookiejar ': response.meta[' Cookiejar '},
Headers=self.header,
Formdata=data,
Callback=self.after_login,)]
def after_login (self,response):
print ' After login '
Print Response.Status
header={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/ 537.36 '}
return [Request ("Http://my.csdn.net/my/mycsdn", meta={' Cookiejar ': response.meta[' Cookiejar ']},headers=header, Callback=self.parse)]
Def parse (self, Response):
Print Response.text.decode (' utf-8 '). Encode (Self.type)

Python web crawler use scrapy automatic login website

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More