Scrapy Notes (11)-Analog Login

Last Update:2018-07-29 Source: Internet

Author: User

Tags webp xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sometimes you need to log in when crawling a Web site, in Scrapy you can save cookies by simulating a login and then crawl to the appropriate page. Here I demonstrate the whole principle by logging in to GitHub and then crawling my own issue list.

To implement a login requires form submission, first access the GitHub login page Https://github.com/login through the browser, and then use the browser debugging tools to get what you need to submit when you log in.

I use the Chrome browser debugging tool here, F12 Open and choose Network, and preserve log on the hook. I intentionally input the wrong username and password, get the form form parameter that it submitted and the Post submitted ur

To view the HTML source, you will find that there is a hidden Authenticity_token value in the form that needs to be fetched and submitted with the username and password. overriding the Start_requests method

To use cookies, the first step is to open it, the default scrapy using Cookiesmiddleware middleware, and open. If you have previously banned, please set the following

Cookies_enables = True

Let's start by opening the login page to get the Authenticity_token value, where I rewrite the Start_requests method

# rewrite the reptilian method, implement the custom request, and then invoke the callback callback function
def start_requests (self): return
    [Request ("https://github.com/ Login ",
                    meta={' Cookiejar ': 1}, Callback=self.post_login)]

# formrequeset
def post_login (self, response) :
    # Get the hidden form parameter First authenticity_token
    Authenticity_token = Response.xpath (
        '//input[@name = "Authenticity_ Token "]/@value"). Extract_first ()
    logging.info (' authenticity_token= ' + authenticity_token)
    Pass

The Start_requests method specifies a callback function to get the hidden form value authenticity_token, and we also specify the Cookiejar metadata for the request, which is used to back up the function pass cookie identification. using Formrequest

Scrapy prepared a Formrequest class specifically for form submission.

# in order to simulate the browser, we define Httpheader post_headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/ webp,*/*;q=0.8 "," accept-encoding ":" gzip, deflate "," accept-language ":" zh-cn,zh;q=0.8,en;q=0.6 "," Cache-cont Rol ":" No-cache "," Connection ":" Keep-alive "," Content-type ":" application/x-www-form-urlencoded "," user-agent ":" mozilla/5.0 (Windows NT 6.1;
WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36 ", Referer": "https://github.com/",} # Submit def Post_login using Formrequeset mock form (self, Response): # Go get the hidden form parameters Authenticity_token Authenticity_token = respons E.xpath ('//input[@name = ' authenticity_token ']/@value '). Extract_first () logging.info (' authenticity_token= ' + AU Thenticity_token) # Formrequeset.from_response is a function provided by Scrapy for post form # When login is successful, the After_login callback function is invoked if the URL and the request page The same is omitted return [Formrequest.from_response (response, url= ' https://github.com/Session ', meta={' Cookiejar ': response.meta[' Cookiejar '},
                                          Headers=self.post_headers, # Notice the headers formdata={here
                                          ' UTF8 ': ' ✓ ', ' Login ': ' yidao620c ', ' Password ': ' Hu Jintao ', ' Authenticity_token ': authenticity_to
                                      Ken}, Callback=self.after_login,
    Dont_filter=true)] def after_login (self, Response): Pass

The Formrequest.from_response () method lets you specify the URL to submit, the request header, and the form form value, and note that we also pass the cookie identification through meta. It also has a callback function that is invoked after the login succeeds. Let's implement it.

def after_login (self, Response):
    # After login, start entering the DMS page I want to crawl for
    URL in self.start_urls:
        # Because we defined rule above, So just a simple build initial crawl request can be
        yield request (URL, meta={' Cookiejar ': response.meta[' Cookiejar ')})

Here I define the start page through Start_urls, and then generate the request, and the specific crawl rules and next page rules are defined in the previous rule. Note that I continue to pass Cookiejar, and I will take cookie information when I visit the initial page. rewrite _requests_to_follow

There's a problem that's been bothering me for a long time. Spider inherits from Crawlspider, it automatically downloads matching links, and every time you go to the link, it doesn't automatically bring cookies, and then I rewrite the _requests_to_follow () Method solves this problem.

def _requests_to_follow (self, Response): "" "
    override Add Cookiejar Update" "If not
    isinstance (response, Htmlresponse):
        return
    seen = set ()
    for N, rule in enumerate (Self._rules):
        links = [L for L in Rule.link_extractor.extrac T_links (response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links (links) for
        Li NK in Links:
            seen.add (link)
            r = Request (Url=link.url, callback=self._response_downloaded)
            # I rewrite the following sentence
            r.meta.update (Rule=n, Link_text=link.text, cookiejar=response.meta[' Cookiejar '])
            yield Rule.process_ Request (R)

Page Processing Methods

Within rule rules I define the callback function for each link parse_page, which is the logic that ultimately we process each issue page extraction information

def parse_page (self, Response): "" "
    this is to use Linkextractor to automatically process links and ' next page '" ""
    logging.info ( U '--------------message split Line-----------------')
    logging.info (response.url)
    issue_title = Response.xpath (
        '/ /span[@class = "Js-issue-title"]/text ()). Extract_first ()
    logging.info (U ' issue_title: ' + issue_title.encode (' Utf-8 '))

Complete Source

#!/usr/bin/env python #-*-encoding:utf-8-*-"" Topic: Login crawler Desc: Simulate login https://github.com and then crawl all of your issue. Tips: Using Chrome When debugging a post form, tick preserve log and disable cache "" "Import logging import re import sys import scrapy from scrapy.spiders import C Rawlspider, rule from scrapy.linkextractors import linkextractor from scrapy.http import Request, Formrequest, Htmlrespon Se logging.basicconfig (level=logging.info, format= '% (asctime) s% (filename) s[line:% (lineno) d]% (Leveln AME) s% (message) s ', datefmt= '%y-%m-%d%h:%m:%s ', handlers=[logging.
    Streamhandler (sys.stdout)]) class Githubspider (crawlspider): name = "GitHub" allowed_domains = ["github.com"] Start_urls = [' https://github.com/issues ',] rules = (# message list rule (Linkextractor allow=
             ('/issues/\d+ ',), restrict_xpaths= '//ul[starts-with (@class, "table-list")]/li/div[2]/a[2],
     callback= ' Parse_page '),   # next page, If callback is None follow defaults to True, otherwise it defaults to False rule linkextractor (restrict_ xpaths= '//a[@class = ' next_page ']) post_headers = {"Accept": "Text/html,application/xhtml+xml,applicat ion/xml;q=0.9,image/webp,*/*;q=0.8 "," accept-encoding ":" gzip, deflate "," accept-language ":" Zh-cn,zh;q=0. " 8,en;q=0.6 "," Cache-control ":" No-cache "," Connection ":" Keep-alive "," Content-type ":" Application /x-www-form-urlencoded "," user-agent ":" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.75 safari/537.36 ", Referer": "https://github.com/" ,} # Rewrites the Reptilian method, implements the custom request, and invokes the callback callback function Def start_requests (self) When it succeeds: return [Request ("Https://gi Thub.com/login ", meta={' Cookiejar ': 1}, Callback=self.post_login)] # Formrequeset def POS
     T_login (self, Response): # Get the hidden form parameters first Authenticity_token   Authenticity_token = Response.xpath ('//input[@name = ' authenticity_token ']/@value '). Extract_first ()
        Logging.info (' authenticity_token= ' + authenticity_token) # Formrequeset.from_response is a function provided by scrapy for post forms
                                          # After successful landing, the After_login callback function is invoked, and if the URL is the same as the request page, omit return [Formrequest.from_response (response, Url= ' https://github.com/session ', meta={' Cookiejar ':
                                          response.meta[' Cookiejar ']}, headers=self.post_headers, # Notice the headers here
                                              formdata={' UTF8 ': ' ✓ ',
                                              ' Login ': ' yidao620c ', ' Password ': ' Hu Jintao ',
                                          ' Authenticity_token ': Authenticity_token
                  },                        Callback=self.after_login, Dont_filter=true )] def after_login (self, Response): For URL in self.start_urls: # The rule is defined for us, so simply generate the initial crawl request to yield request (URL, meta={' Cookiejar ': response.meta[' Cookiejar ')}) d EF parse_page (Self, Response): "" This is to use Linkextractor to automatically process links and ' next page ' "" "Logging.info (U '--------------message Split line--- --------------') logging.info (response.url) issue_title = Response.xpath ('//span[@class = "Js-i Ssue-title "]/text ()"). Extract_first () logging.info (U ' issue_title: ' + issue_title.encode (' Utf-8 ')) def _reques
            Ts_to_follow (self, Response): "" "Override Add Cookiejar Update" "If not isinstance (response, Htmlresponse): return seen = set () for N, rule in enumerate (Self._rules): links = [L for L in Rule.link_ext Ractor.extract_links (resPonse) if l not in seen] if links and rule.process_links:links = Rule.process_links (links) For link in links:seen.add (link) r = Request (Url=link.url, Callback=self._respo nse_downloaded) # The following sentence is my rewrite of the R.meta.update (Rule=n, Link_text=link.text, Cookiejar=response. meta[' Cookiejar ']) yield rule.process_request (R)

You can see the full project source code of this article in the GitHub, there is another example of automatic landing Iteye website. Https://github.com/yidao620c/core-scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More