No. 333, web crawler to explain 2-scrapy framework crawler-scrapy Simulation Browser Login
Impersonate a browser login
start_requests () method, you can return a request to the crawler's starting site, this return request is equivalent to start_urls,start_requests () The request returned will replace the request in Start_urls
Request ()get requests, can be set, URL, cookie, callback function
formrequest.from_response () form post submission, first required parameter, last response cookie Response object, other parameters, cookie, url, form content, etc.
yield request () can return a new request to the crawler to execute
The operation of the cookie when sending the request,
meta={' Cookiejar ': 1} indicates a cookie record is opened and written in Request ()
meta={' Cookiejar ': response.meta[' Cookiejar '} means using the last response cookie, written in Formrequest.from_response () In post authorization
meta={' Cookiejar ': True} means using the licensed cookie to access pages that need to be logged in to view
Get scrapy Framework Cookies
Request a Cookie
Cookie = response.request.headers.getlist (' cookie ')
Print (Cookie)
Response Cookie
Cookie2 = response.headers.getlist (' Set-cookie ')
Print (COOKIE2)
#-*-coding:utf-8-*-Importscrapy fromScrapy.httpImportrequest,formrequestclassPachspider (scrapy. Spider):#To define reptiles, you must inherit scrapy. SpiderName ='Pach' #Set crawler nameAllowed_domains = ['edu.iqianyue.com']#Crawl domain Names #start_urls = [' http://edu.iqianyue.com/index_user_login.html '] #爬取网址, only for requests that do not require login, because cookies and other information cannot be setHeader= {'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) gecko/20100101 firefox/54.0'}#Set Browser user agent defStart_requests (self):#Replace Start_urls with the Start_requests () method """The first time you request the login page, set the cookie to be opened, set the callback function""" return [Request ('http://edu.iqianyue.com/index_user_login.html'), meta={'cookiejar': 1}, callback=self.parse)] defParse (self, Response):#Parse callback functionData= {#Set the user login information, corresponding to grab the packet to get the field ' Number':'adc8868', 'passwd':'279819', 'Submit':"' } #Response Cookie Cookie1 = response.headers.getlist ('set-cookie') # Look at the response Cookie, The cookie print (COOKIE1) that is written back to the browser when the first time you visit the registration page Print('Sign In') """second time Use form POST request, carry cookie, browser agent, user login information, login to cookie authorization""" return[Formrequest.from_response (response, URL='Http://edu.iqianyue.com/index_user_login',#Real Post Address meta={' cookiejar': response.meta['cookiejar' ]}, headers=Self.header, Formdata=data, Callback=Self.next,)] defNext (self,response): a= Response.body.decode ("Utf-8")#You can check the login response information after logging in #print (a) """requests to log in to view a page, such as a personal center, with an authorized cookie request""" yield Request ('http://edu.iqianyue.com/index_user_index.html',meta ={'cookiejar': True}, callback=self.next2) defnext2 (self,response):#Request a Cookie Cookie2 = response.request.headers.getlist ('Cookie') print /c1> (Cookie2) body= Response.body#get page content byte typeUnicode_body = Response.body_as_unicode ()#Get Site content String typea= Response.xpath ('/html/head/title/text ()'). Extract ()#Get a Personal center page Print(a)
No. 333, web crawler explains 2-scrapy framework crawler-scrapy emulation Browser login-Get scrapy frame cookies