Python web crawler use scrapy automatic login website

Source: Internet
Author: User
Tags session id python web crawler

The previous introduction of requests implementation of automatic login method. This article describes how to use Scrapy to implement automatic login. Or take the CSDN website as an example.
Scrapy uses formrequest to log in and submit data to the server. Just with the extra formdata parameter used to transfer the login form information (username and password), in order to use this class, the following statement is required to import: from scrapy.http import formrequest
So with regard to the use of cookie values during login, Scrapy will automatically process cookies for us, as long as we are successful, it will automatically transfer cookies like a browser.
First, the crawler defines start_requests
Start_requests (self):

[Request ("Http://passport.csdn.net/account/login", meta={' Cookiejar ': 1},callback=self.post_ login,method="POST")]

Where the requests method is used to first access the login site. The meta attribute is a dictionary, the dictionary format is {' key ': ' Value '}, and the dictionary is a mutable container model that can store any type of object.

The purpose of the meta parameter in request is to pass information to the next function, which can be any type, such as a value, a string, a list, a dictionary ... The method is to assign the information to be passed to the Meta dictionary key. The above start_requests key ' Cookiejar ' is a special key, Scrapy will automatically pass the cookie to the function to be callback after seeing this key in Meta. Since it is a key, it is necessary to have a value corresponding to it, the example gives the number 1, or it can be another value, such as any one string.

Callback is the function that needs to be tuned after connecting to the login site. Let's see how Post_login is implemented

def post_login (self,response):
Html=beautifulsoup (Response.text, "Html.parser")
For input in Html.find_all (' input '):
If ' name ' in Input.attrs and input.attrs[' name '] = = ' LT ':
lt=input.attrs[' value ']
If ' name ' in Input.attrs and input.attrs[' name '] = = ' execution ':
e1=input.attrs[' value ']
data={' username ': ' xxxx ', ' Password ': ' xxxxx ', ' lt ': lt, ' execution ': E1, ' _eventid ': ' Submit '}
return [Formrequest.from_response (response,
meta={' Cookiejar ': response.meta[' Cookiejar '},
Headers=self.header,
Formdata=data,
Callback=self.after_login,)]

The first is to get the value of the Lt,execution field, which is explained in the previous post that introduced requests.

Then call Formrequest.from_response. The function of this method is to construct the form data from the page returned in response, so the first parameter is response. The page returned here response the Http://passport.csdn.net/account/login that was called in the previous requests

The next parameter is meta. Heasers,formadata and callback. The callback here is a function that points to login.

The implementation of After_login is as follows.

def after_login (self,response):
print ' After login '
Print Response.Status
header={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/ 537.36 '}
return [Request ("Http://my.csdn.net/my/mycsdn", meta={' Cookiejar ': response.meta[' Cookiejar ']},headers=header, Callback=self.parse)]
Def parse (self, Response):
Print Response.text.decode (' utf-8 '). Encode (Self.type)
After the run we look at the log of the log . You can see it from the red-labeled Red section below. After initiating a login request to http://passport.csdn.net/account/login ,scrapy immediately 
initiate a request to HTTP://PASSPORT.CSDN.NET/ACCOUNT/LOGIN;JSESSIONID=8B4A62EA09BBB5F1FBF4D921B64FECE6.TOMCAT2. This is called Formrequest.from_response trigger. The Jsessionid value in the back of the link is the session ID that was acquired before accessing the login site , where Scrapy is added automatically.    
2017-10-16 22:17:34 [Scrapy] Info:spider opened
2017-10-16 22:17:34 [scrapy] info:crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-16 22:17:34 [scrapy] debug:telnet console listening on 127.0.0.1:6023
2017-10-16 22:17:34 [scrapy] debug:crawled (404) <get http://passport.csdn.net/robots.txt> (Referer:none)
2017-10-16 22:17:34 [Scrapy] debug:crawled ($) <post http://passport.csdn.net/account/login> (referer : None)
2017-10-16 22:17:34 [Scrapy] debug:crawled ($) <post http://passport.csdn.net/account/login;jsessionid= 8b4a62ea09bbb5f1fbf4d921b64fece6.tomcat2> (referer:http://www.csdn.net/)
2017-10-16 22:17:35 [Scrapy] debug:crawled ($) <get http://my.csdn.net/robots.txt> (Referer:none)
2017-10-16 22:17:35 [Scrapy] debug:crawled ($) <get http://my.csdn.net/my/mycsdn> (referer:http:// PASSPORT.CSDN.NET/ACCOUNT/LOGIN;JSESSIONID=8B4A62EA09BBB5F1FBF4D921B64FECE6.TOMCAT2)
2017-10-16 22:17:35 [scrapy] info:closing spider (finished)
2017-10-16 22:17:35 [scrapy] info:dumping scrapy stats:
{' downloader/request_bytes ': 2022,
data crawled from fiddler can be seen Jsessionid is the header message of the response message returned by the Web site after accessing the login site  . That is, the cookie value set by the website 
The complete code:
#-*-Coding:utf-8-*-#
From scrapy.spiders import Spider,crawlspider,rule
From Scrapy.selector import Selector
From scrapy.http import Request
From scrapy import formrequest

From Test2.items import Test2item
From Scrapy.utils.response import Open_in_browser
From scrapy.linkextractors import Linkextractor
From BS4 import BeautifulSoup
Import Sys


Class Testspider (Spider):
Name= "Test2"
allowd_domains=[' http://www.csdn.net/']
header={' host ': ' Passport.csdn.net ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/537.36 ', ' Referer ': ' http://www.csdn.net/'}
start_urls=["http://www.csdn.net/"]
Reload (SYS)
Sys.setdefaultencoding (' Utf-8 ')
Type = Sys.getfilesystemencoding ()
def start_requests (self):

return [Request ("Http://passport.csdn.net/account/login", meta={' Cookiejar ': 1},callback=self.post_login,method= " POST ")]

def post_login (self,response):
Html=beautifulsoup (Response.text, "Html.parser")
For input in Html.find_all (' input '):
If ' name ' in Input.attrs and input.attrs[' name '] = = ' LT ':
lt=input.attrs[' value ']
If ' name ' in Input.attrs and input.attrs[' name '] = = ' execution ':
e1=input.attrs[' value ']
data={' username ': ' xxxx ', ' Password ': ' xxxxx ', ' lt ': lt, ' execution ': E1, ' _eventid ': ' Submit '}
return [Formrequest.from_response (response,
meta={' Cookiejar ': response.meta[' Cookiejar '},
Headers=self.header,
Formdata=data,
Callback=self.after_login,)]
def after_login (self,response):
print ' After login '
Print Response.Status
header={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.80 safari/ 537.36 '}
return [Request ("Http://my.csdn.net/my/mycsdn", meta={' Cookiejar ': response.meta[' Cookiejar ']},headers=header, Callback=self.parse)]
Def parse (self, Response):
Print Response.text.decode (' utf-8 '). Encode (Self.type)

Python web crawler use scrapy automatic login website

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.