Python uses the Srapy framework crawler to simulate login and capture the knowledge content, pythonsrapy

Last Update:2016-07-23 Source: Internet

Author: User

Tags network function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python uses the Srapy framework crawler to simulate login and capture the knowledge content, pythonsrapy

I. Cookie principles
HTTP is a stateless connection-oriented protocol. To maintain the connection status, the Cookie mechanism is introduced.
Cookie is an attribute in the http message header, including:

Cookie Name Cookie Value)
Cookie expiration time (Expires/Max-Age)
Path)
The Domain name of the Cookie. Use the Cookie for Secure connection (Secure)

The first two parameters are necessary for Cookie applications. In addition, they also include the Cookie Size (which varies with the number and Size of cookies in different browsers ).

2. Simulate Login
The main website crawled this time is zhihu.
You need to log on to the website after crawling knowledge. Using the previous python library, you can easily implement form submission.

Now let's take a look at how to implement form submission through Scrapy.

First, check the form result when logging on, and intentionally enter the wrong password, just like the previous technique, (I am using the Network function in Chrome's developer tool)

You can view the captured form in four parts:

The email address and password are the personal logon email address and password.
The rememberme field indicates whether to remember the account.
The first field is _ xsrf. Speculation is a verification mechanism.
Currently, only _ xsrf is unknown. If we assume that this verification field will be sent when the webpage is requested, we can view the source code of the current webpage (right-click and view the webpage source code, or directly use the shortcut key)

We found that our guesses are correct.

Now you can write the form login function.

Def start_requests (self): return [Request ("https://www.zhihu.com/login", callback = self. post_login)] # rewrite the crawling method to implement a custom request. After the operation is successful, the callback function is called # FormRequeset def post_login (self, response ): print 'preparing login' # The following sentence is used to capture the text of the _ xsrf field in the returned webpage after the request webpage is sent. It is used to successfully submit the form xsrf = Selector (response ). xpath ('// input [@ name = "_ xsrf"]/@ value '). extract () [0] print xsrf # FormRequeset. from_response is a function provided by Scrapy. It is used for post form # after login is successful, it calls the after_login callback function return [FormRequest. from_response (response, formdata = {'_ xsrf': xsrf, 'email': '000000', 'Password': '000000'}, callback = self. after_login)]

The main functions are described in the function comments.
3. Cookie Storage
To continuously crawl a website in the same status, you need to save the cookie and use the cookie to save the status. Scrapy provides the middleware for cookie processing, which can be used directly.

CookiesMiddleware:

This cookie middleware stores the cookie sent by the web server and sends the cookie when it receives the request.
The following code example is provided in the official Scrapy document:

for i, url in enumerate(urls):  yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},    callback=self.parse_page)def parse_page(self, response):  # do some processing  return scrapy.Request("http://www.example.com/otherpage",    meta={'cookiejar': response.meta['cookiejar']},    callback=self.parse_other_page)

Then we can modify the methods in our crawling class so that they can track cookies.

# Override the crawler class method, realize the custom Request, after successful running will call the callback function def start_requests (self): return [Request ("https://www.zhihu.com/login ", meta = {'cookiejar': 1}, callback = self. post_login)] # added meta # a problem occurs in FormRequeset def post_login (self, response ): print 'preparing login' # The following sentence is used to capture the text of the _ xsrf field in the returned webpage after the request webpage is sent. It is used to successfully submit the form xsrf = Selector (response ). xpath ('// input [@ name = "_ xsrf"]/@ value '). extract () [0] print xsrf # FormRequeset. from_response is a function provided by Scrapy. It is used for post form # after login is successful, it calls the after_login callback function return [FormRequest. from_response (response, # "http://www.zhihu.com/login", meta = {'cookiejar': response. meta ['cookiejar']}, # note that headers = self. headers, formdata = {'_ xsrf': xsrf, 'email': '000000', 'Password': '000000'}, callback = self. after_login, dont_filter = True)]

4. Camouflage Headers
Sometimes, you need to perform head camouflage when logging on to a website, such as adding anti-leeching headers and simulating Server login.

To ensure security, we can fill in more fields in the header, as shown below:

  headers = {  "Accept": "*/*",  "Accept-Encoding": "gzip,deflate",  "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",  "Connection": "keep-alive",  "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",  "Referer": "http://www.zhihu.com/"  }

In scrapy, both Request and FormRequest have a headers field during initialization. You can customize the header so that we can add the headers field.

Form final login Functions

#! /Usr/bin/env python #-*-coding: UTF-8-*-from scrapy. contrib. spiders import crawler, Rulefrom scrapy. selector import Selectorfrom scrapy. contrib. linkextractors. sgml import SgmlLinkExtractorfrom scrapy. http import Request, FormRequestfrom zhihu. items import ZhihuItemclass ZhihuSipder (crawler): name = "zhihu" allowed_domains = ["www.zhihu.com"] start_urls = ["http://www.zhihu.com"] rules = (Rule (SgmlLinkExtractor (allow = ('/question/\ d + #.*? ',), Callback = 'parse _ page', follow = True), Rule (SgmlLinkExtractor (allow = ('/question/\ d + ',)), callback = 'parse _ page', follow = True),) headers = {"Accept": "*/*", "Accept-Encoding": "gzip, deflate ", "Accept-Language": "en-US, en; q = 0.8, zh-TW; q = 0.6, zh; q = 0.4", "Connection ": "keep-alive", "Content-Type": "application/x-www-form-urlencoded; charset = UTF-8", "User-Agent ": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36", "Referer ": "http://www.zhihu.com/"} # override the crawler class method to implement a custom Request, after successful running will call the callback function def start_requests (self): return [Request ("https://www.zhihu.com/login ", meta = {'cookiejar': 1}, callback = self. post_login)] # FormRequeset: def post_login (self, response): print 'preparing login' # The following sentence is used to capture the text of the _ xsrf field returned from the webpage after the request webpage, used to successfully submit the form xsrf = Selector (response ). xpath ('// input [@ name = "_ xsrf"]/@ value '). extract () [0] print xsrf # FormRequeset. from_response is a function provided by Scrapy. It is used for post form # after login is successful, it calls the after_login callback function return [FormRequest. from_response (response, # "http://www.zhihu.com/login", meta = {'cookiejar': response. meta ['cookiejar']}, headers = self. headers, # note that headers formdata = {'_ xsrf': xsrf, 'email': '2017 @ qq.com ', 'Password': '000000'}, callback = self. after_login, dont_filter = True)] def after_login (self, response): for url in self. start_urls: yield self. make_requests_from_url (url) def parse_page (self, response): problem = Selector (response) item = ZhihuItem () item ['url'] = response. url item ['name'] = problem. xpath ('// span [@ class = "name"]/text ()'). extract () print item ['name'] item ['title'] = problem. xpath ('// h2 [@ class = "zm-item-title zm-editable-content"]/text ()'). extract () item ['description'] = problem. xpath ('// div [@ class = "zm-editable-content"]/text ()'). extract () item ['ancer'] = problem. xpath ('// div [@ class = "zm-editable-content clearfix"]/text ()'). extract () return item

5. Item class and capture Interval
Complete zhihu crawler code Link

From scrapy. item import Item, Fieldclass ZhihuItem (Item): # define the fields for your item here like: # name = scrapy. field () url = Field () # Save the capture question url title = Field () # capture question title description = Field () # capture question description answer = Field () # capture the answer to the question name = Field () # name of the individual user

Set the crawling interval. Because crawlers are too fast to crawl, the crawling mechanism of the website is triggered. In setting. py, Set

BOT_NAME = 'zhihu' SPIDER _ MODULES = ['zhihu. spiders'] NEWSPIDER_MODULE = 'zhihu. spiders' DOWNLOAD _ DELAY = 0.25 # Set the DOWNLOAD interval to 250 ms

For more settings, see the official documentation.

Capture results (only a few of them are intercepted)

... 'url': 'http://www.zhihu.com/question/20688855/answer/16577390'}2014-12-19 23:24:15+0800 [zhihu] DEBUG: Crawled (200) <GET http://www.zhihu.com/question/20688855/answer/15861368> (referer: http://www.zhihu.com/question/20688855/answer/19231794)[]2014-12-19 23:24:15+0800 [zhihu] DEBUG: Scraped from <200 http://www.zhihu.com/question/20688855/answer/15861368>  {'answer': [u'\u9009\u4f1a\u8ba1\u8fd9\u4e2a\u4e13\u4e1a\uff0c\u8003CPA\uff0c\u5165\u8d22\u52a1\u8fd9\u4e2a\u884c\u5f53\u3002\u8fd9\u4e00\u8def\u8d70\u4e0b\u6765\uff0c\u6211\u53ef\u4ee5\u5f88\u80af\u5b9a\u7684\u544a\u8bc9\u4f60\uff0c\u6211\u662f\u771f\u7684\u559c\u6b22\u8d22\u52a1\uff0c\u70ed\u7231\u8fd9\u4e2a\u884c\u4e1a\uff0c\u56e0\u6b64\u575a\u5b9a\u4e0d\u79fb\u5730\u5728\u8fd9\u4e2a\u884c\u4e1a\u4e2d\u8d70\u4e0b\u53bb\u3002',        u'\u4e0d\u8fc7\u4f60\u8bf4\u6709\u4eba\u4ece\u5c0f\u5c31\u559c\u6b22\u8d22\u52a1\u5417\uff1f\u6211\u89c9\u5f97\u51e0\u4e4e\u6ca1\u6709\u5427\u3002\u8d22\u52a1\u7684\u9b45\u529b\u5728\u4e8e\u4f60\u771f\u6b63\u61c2\u5f97\u5b83\u4e4b\u540e\u3002',        u'\u901a\u8fc7\u5b83\uff0c\u4f60\u53ef\u4ee5\u5b66\u4e60\u4efb\u4f55\u4e00\u79cd\u5546\u4e1a\u7684\u7ecf\u8425\u8fc7\u7a0b\uff0c\u4e86\u89e3\u5176\u7eb7\u7e41\u5916\u8868\u4e0b\u7684\u5b9e\u7269\u6d41\u3001\u73b0\u91d1\u6d41\uff0c\u751a\u81f3\u4f60\u53ef\u4ee5\u638c\u63e1\u5982\u4f55\u53bb\u7ecf\u8425\u8fd9\u79cd\u5546\u4e1a\u3002',        u'\u5982\u679c\u5bf9\u4f1a\u8ba1\u7684\u8ba4\u8bc6\u4ec5\u4ec5\u505c\u7559\u5728\u505a\u5206\u5f55\u8fd9\u4e2a\u5c42\u9762\uff0c\u5f53\u7136\u4f1a\u89c9\u5f97\u67af\u71e5\u65e0\u5473\u3002\u5f53\u4f60\u5bf9\u5b83\u7684\u8ba4\u8bc6\u8fdb\u5165\u5230\u6df1\u5c42\u6b21\u7684\u65f6\u5019\uff0c\u4f60\u81ea\u7136\u5c31\u4f1a\u559c\u6b22\u4e0a\u5b83\u4e86\u3002\n\n\n'],   'description': [u'\u672c\u4eba\u5b66\u4f1a\u8ba1\u6559\u80b2\u4e13\u4e1a\uff0c\u6df1\u611f\u5176\u67af\u71e5\u4e4f\u5473\u3002\n\u5f53\u521d\u662f\u51b2\u7740\u5e08\u8303\u4e13\u4e1a\u62a5\u7684\uff0c\u56e0\u4e3a\u68a6\u60f3\u662f\u6210\u4e3a\u4e00\u540d\u8001\u5e08\uff0c\u4f46\u662f\u611f\u89c9\u73b0\u5728\u666e\u901a\u521d\u9ad8\u4e2d\u8001\u5e08\u5df2\u7ecf\u8d8b\u4e8e\u9971\u548c\uff0c\u800c\u987a\u6bcd\u4eb2\u5927\u4eba\u7684\u610f\u9009\u4e86\u8fd9\u4e2a\u4e13\u4e1a\u3002\u6211\u559c\u6b22\u4e0a\u6559\u80b2\u5b66\u7684\u8bfe\uff0c\u5e76\u597d\u7814\u7a76\u5404\u79cd\u6559\u80b2\u5fc3\u7406\u5b66\u3002\u4f46\u4f1a\u8ba1\u8bfe\u4f3c\u4e4e\u662f\u4e3b\u6d41\u3001\u54ce\u3002\n\n\u4e00\u76f4\u4e0d\u559c\u6b22\u94b1\u4e0d\u94b1\u7684\u4e13\u4e1a\uff0c\u6240\u4ee5\u5f88\u597d\u5947\u5927\u5bb6\u9009\u4f1a\u8ba1\u4e13\u4e1a\u5230\u5e95\u662f\u51fa\u4e8e\u4ec0\u4e48\u76ee\u7684\u3002\n\n\u6bd4\u5982\u8bf4\u5b66\u4e2d\u6587\u7684\u4f1a\u8bf4\u4ece\u5c0f\u559c\u6b22\u770b\u4e66\uff0c\u4f1a\u6709\u4ece\u5c0f\u559c\u6b22\u4f1a\u8ba1\u501f\u554a\u8d37\u554a\u7684\u7684\u4eba\u5417\uff1f'],   'name': [],   'title': [u'\n\n', u'\n\n'],   'url': 'http://www.zhihu.com/question/20688855/answer/15861368'}...

6. Problems

The Rule design cannot achieve full-site crawling, but it just sets a simple issue of crawling.
The Xpath settings are not rigorous and you need to rethink
Unicode encoding should be converted to UTF-8

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More