The previous article describes how to crawl the watercress TOP250 content, today we are to simulate landing github.
1 Environment Configuration
语言:Python 3.6.1 IDE: Pycharm浏览器:firefox抓包工具:fiddler爬虫框架:Scrapy 1.5.0操作系统:Windows 10 家庭中文版
2 Pre-crawl analysis
Analyze Login Submission Information
Analysis of the login information I use the use of Fiddler,fiddler is not introduced, we can search by ourselves, first we open the GitHub landing page, enter the user name password, submit to view fiddler get information, I here for the first time deliberately entered the wrong password, Results into the following:
Landing page (Https://github.com/login):
Enter the user name and error password to get the fiddler result:
Here we can also use the Firefox developer tool to see that the form is submitted with the Authenticity_token parameter added, such as:
3 Start Crawler 3.1 Create project
Go to the folder where you want to store the item and execute the following command:
scrpy startproject githubspider
You can continue to create the spider as prompted, or you can manually create it yourself, using the following command:
cd githubspiderscrapy genspider example example.com
Such as:
The contents of the project are as follows:
3.2 Prepare before starting
In the Scrapy.cfg sibling directory, create the Pycharm debug Script run.py, which reads as follows:
# -*- coding: utf-8 -*-from scrapy import cmdlinecmdline.execute(‘scrapy crawl github‘.split())
Modify Robotstxt_obey in Settings = True parameter is False, because the default is true, is to comply with robots.txt rules, robots.txt is to follow the robot protocol of a file, it is saved in the Web site server, Its role is to tell the search engine crawler, the site of the directory under which the page does not want you to crawl collection. After Scrapy is started, the site's robots.txt file is accessed the first time, and then the crawl scope of the site is determined. View robots.txt can be directly after the URL to connect robots.txt like Baidu: Https://www.baidu.com/robots.txt
If we do not modify this parameter, the result is as follows:
3.3 Get Authenticity_token
First, to open the landing page, get Authenticity_token, the code is as follows:
Import scrapyfrom scrapy.http import Requestclass githubspider (scrapy. Spider): name = ' GitHub ' allowed_domains = [' github.com '] def start_requests (self): urls = [' Https://githu B.com/login '] for URL in URLs: # Override the Start_requests method, pass the meta-pass to the special key Cookiejar, crawl the URL as a parameter to the callback function y Ield Request (URL, meta={' Cookiejar ': 1}, Callback=self.github_login) def github_login (self, Response): # First Get AUT Henticity_token, here you can use the scrapy shell "url" to get the page # and then get the value from the source to authenticity_token Authenticity_token = response. XPath ("//input[@name = ' Authenticity_token ']/@value"). Extract_first () # Print info info with scrapy built-in logger self.logger.i NFO (' authenticity_token= ' + authenticity_token) pass
The results are as follows:
You can see that we have obtained the value of Authenticity_token, which focuses on Meta, cookiejar, and logger.
Meta: meta-data in a dictionary format that can be passed to the next function Meta official website explained;
Cookiejar: Meta is a special key, through the Cookiejar parameter can support multiple sessions to crawl a site, you can mark the cookie, 1,2,3,4 ... so scrapy to maintain a plurality of sessions;
Logger:scrapy for each spider instance built-in logger specific information refer to the official website logging.
3.4 Formrequest
Scrapy provides the Formrequest class, which is an extension of the request class, specifically for form form submission. We mainly use the Formrequest.from_response () method to simulate a simple login, submitted by Formrequest.from_response, to the callback function processing. The code is as follows:
def github_login (self, Response): # First Get Authenticity_token, where you can get page # with scrapy shell "url" and get it from source to AUT The value of Henticity_token Authenticity_token = Response.xpath ("//input[@name = ' Authenticity_token ']/@value"). Extract_first () self.logger.info (' authenticity_token= ' + authenticity_token) # URL can be obtained from the Fiddler Crawl, the Dont_click function is if true, form The data will be submitted without having to click on any of the elements. Return Formrequest.from_response (response, url= ' https://github.com/session ', meta={' Cookiejar ': response.meta[' Cookiejar '}, Headers=self.headers, formdata={' UTF8 ': '? ', ' Authenticity_token ': authenticity_token, ' login ': ' [email protected] ', ' Password ': ' xxxxxx '}, Callback=self.github_after, Dont_click=true, )
The code for the callback function is as follows:
def github_after(self, response): # 获取登录页面主页中的字符串‘Browse activity‘ list = response.xpath("//a[@class=‘tabnav-tab selected‘]/text()").extract() # 如果含有字符串,则打印日志说明登录成功 if ‘Browse activity‘ in list: self.logger.info(‘我已经登录成功了,这是我获取的关键字:Browse activity‘)
The homepage is as follows:
The complete github.py code is as follows:
#-*-Coding:utf-8-*-import scrapyfrom scrapy.http import Request, Formrequestclass Githubspider (scrapy. Spider): name = ' GitHub ' allowed_domains = [' github.com '] # header information copied directly from fiddler headers = {' User-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) gecko/20100101 firefox/58.0 ', ' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ' , ' accept-language ': ' zh-cn,zh;q=0.8,zh-tw;q=0.7,zh-hk;q=0.5,en-us;q=0.3,en;q=0.2 ', ' accept-encoding ': ' gzip , deflate, Br ', ' Referer ': ' https://github.com/', ' content-type ': ' application/x-www-form-urlencoded ',} def start_requests (self): urls = [' https://github.com/login '] for URL in URLs: # rewrite Start_reque Sts method, via meta-incoming Cookiejar special key, crawl URL as parameter to the callback function yield Request (URL, meta={' Cookiejar ': 1}, Callback=self.github_log IN) def github_login (self, Response): # First Get Authenticity_token, here can use scrapy Shell "url" to get the page # and then from the sourceGets the value of Authenticity_token to Authenticity_token = Response.xpath ("//input[@name = ' Authenticity_token ']/@value"). Extract_first () self.logger.info (' authenticity_token= ' + authenticity_token) # URL can be obtained from fiddler Crawl, Dont_click The effect is that if true, the form data will be committed without clicking on any elements. Return Formrequest.from_response (response, url= ' https://github.com/session ', meta={' Cookiejar ': response.meta[' Cookiejar '}, Headers=self.headers, formdata={' UTF8 ': '? ', ' Authenticity_token ': Authenticity_token, # Here's your own account code. ' Login ': ' [email protected] ', ' Password ': ' xxxxxx '}, Callback=self.github_after, Dont_click=true,) def github_after (self, Response): # Get the string ' Browse activity ' list = Response.xpath ("//a[@class = ' Tabnav-tab selected ']/text () ') from the Login page home page. Extract ( # If a string is included, the print log indicates successful login if ' Browse activity ' in list:self.logger.info (' I have logged in successfully, this is the keyword I get: Browse Activity ') Else:self.logger.error (' Login failed ')
The following results are performed:
The login was successful.
4 One more column.
Unconsciously found that I also wrote a lot of articles, after writing the above simulated landing on GitHub, decided to strike the line, simulated landing 51cto, the blog I wrote all listed, organized into a directory.
The 51cto landing is similar to GitHub, except for CSRF instead of Authenticity_token, and the results obtained with Fiddler are as follows:
Here is not much to say, directly attached to the code, 51cto.py (spider) content is as follows:
#-*-Coding:utf-8-*-import scrapyfrom scrapy.http import formrequestfrom githubspider.items import CtospiderItemclass C Tospider (scrapy. Spider): name = ' 51cto ' allowed_domains = [' 51cto.com '] def start_requests (self): urls = [' http://home.51c To.com/index '] for the URL in Urls:yield scrapy. Request (URL, callback=self.cto_login, meta={' Cookiejar ': 1}) def cto_login (self, Response): # Get CSRF value CSR f = Response.xpath ("//input[@name = ' _csrf ']/@value"). Extract_first () headers = {' user-agent ': ' MOZILLA/5 .0 (Windows NT 10.0; Win64; x64; rv:58.0) gecko/20100101 firefox/58.0 ', ' Accept ': ' Text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0 .8 ', ' accept-language ': ' zh-cn,zh;q=0.8,zh-tw;q=0.7,zh-hk;q=0.5,en-us;q=0.3,en;q=0.2 ', ' accept-encod ing ': ' gzip, deflate, BR ', ' Referer ': ' http://blog.51cto.com ', ' content-type ': ' Application/x-www-fo Rm-urlencoded ',} #Here for logger output for debugging use # Self.logger.info ("Get CSRF value of%s"% csrf) yield formrequest.from_response (response, Url= ' http://blog.51cto.com/linuxliu?type=1 ', header S=headers, meta={' Cookiejar ': response.meta[' Cookiejar '}, formdata={# This position note 0 To add quotation marks, otherwise it will be an error, this parameter means whether to remember the password automatically log in within 10 days ' Loginform[rememberme] ': ' 0 ', ' Loginform[username] ': ' xxxx ', ' loginform[password ': ' xxxx ', ' _CSRF ': CSRF,}, Callback=self.after_login, Dont_click=true, ) def after_login (self, Response): # Defines the item instance, items.py has a field defined in item = Ctospideritem () # Gets the page content RESPs = Response.css ("Ul.artical-list li") for RESP in RESPs: # Write to the item field item[' Title_url '] = Resp.css ("a.tit::attr (HREF)"). Extract_first () item[' title ' = Resp.css ("A.tit::text"). Extract_first (). Stri The format of P () # fullname is "[Name] (link)" Because this is the meaning of the link in the # markdown syntax, click the name to open the link directly item[' fullname ') = ' [' + item[' title '] + '] ' + ' (' + item[' title_url '] + ') ' # Here logger is also debug using # self.logger.info ("Ti The value of the Tle URL is:%s, the value of title is%s "% (Tit_url, tit)) yield Item # Next page content gets next_page = Response.css (' LI.N Ext a::attr (HREF) '). Extract_first () # self.logger.info ("Next Link:%s"% next_page) if Next_page is not None: Yield scrapy. Request (Next_page, Callback=self.after_login)
The items.py fields are as follows:
class CtospiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() title_url = scrapy.Field() fullname = scrapy.Field()
Execute command to write to CSV file:
scrapy crawl 51cto -o cto.csv
The file results are as follows:
Stick the FullName I need here and everyone understands:
Operations and Learning Python Reptile Advanced Chapter (v) scrapy crawl watercress film TOP250
Operations and Learning Python Reptile Advanced (iv) Item pipeline introduction (with crawl site get picture to local code)
Operations and Learning Python Reptile Advanced (iii) Spider and items introduction
Operations and Learning Python Reptile Advanced (ii) Simple crawler with Scrapy framework
Operations and Learning Python Reptile Advanced Chapter (i) Scrapy Framework Primer
Operations and Learning Python crawler Intermediate (ix) PYTHON3 MySQL database connection
Operations and Learning Python Reptile Intermediate (eight) MongoDB
Operations and Learning Python Reptile Intermediate (vii) SQLITE3
Operations and Learning Python crawler Intermediate (vi) Base crawler
DevOps Python crawler Intermediate (v) data storage (no database version)
DevOps Python crawler Intermediate (iv) network programming
DevOps Python crawler Intermediate (iii) distributed processes
DevOps Python crawler Intermediate (ii) threading, co-process
DevOps Python crawler Intermediate (i) process
Operations and Learning Python's crawler tool (vi) usage of pyquery
Operations and Learning Python's crawler tool (v) Usage of selenium
Operations and Learning Python's crawler tool (iv) usage of PHANTOMJS
Operations and Learning Python's crawler Toolkit (iii) XPath syntax and lxml library usage
Operation and maintenance of Python's crawler tool (II.) usage of Beautiful soup
Operations and Learning Python's crawler toolkit (i) Usage of requests Library
Operations and Learning Python's crawler Basics (vii) Crawling Bole online object-oriented images
Operation and maintenance of Python crawler basics (vi) crawl Baidu Post Bar
Operations and Learning Python Basics (v) Regular expressions
Operation and maintenance of Python's Reptile Basics (iv.) cookies
Operations and Learning Python's Reptile Foundation (III.) Advanced usage of Urllib module
Operation and maintenance of Python's Reptile Foundation (ii) URLLIB module use
Operation and maintenance of Python's Reptile Foundation (i)
5 Write at the end
Years ago the last one more, during the festival time I will be more, no time may not be more, here in advance to everyone worship an early age, I wish you a Happy New year, 18, another Valentine's Day, also wish everyone Happy Valentine's Day!!!
Operations and Learning Python Reptile Advanced (vi) scrapy simulation landing