2017-07-29 17:50:29
Scrapy is a fast and powerful web crawler framework.
Scrapy is not a function library, but a crawler frame. Crawler Framework is a collection of software structures and functional components that implement crawler functions. Crawler framework is a semi-finished product, can help users to achieve professional web crawler.
I. INTRODUCTION of SCRAPY Framework
- 5+2 structure, 5 main modules plus 2 middleware.
(1) Engine: controls the flow of data between all modules and triggers events based on conditions . no user modification required
(2) Downloader: Download the webpage according to the request . no user modification required
(3) Scheduler: Scheduling management of all crawl requests . no user modification required
(4)Downloader middleware: implement user-configurable controls between the engine, scheduler, and Downloader to modify, discard, add requests, or respond . user can write configuration code
(5)Spider: Parses the response returned by Downloader (Response), generates a crawl item (scraped item), and generates an additional crawl request (request). requires user to write configuration code
(6)Item Pipelines: The crawl item generated by the spider is handled in a pipelined manner, consisting of a set of sequence of operations, such as a pipeline, each operation is an item pipeline type; Possible actions include: cleanup, Examine and check the HTML data in the crawl item, and store the data in the database . requires user to write configuration code
(7)Spider middleware: re-processing of requests and crawl items, making modifications, discards, new requests, or crawl items . user can write configuration code
Three paths to the data stream--1:
1 engine get crawl request from spider
2 engine forwards the crawl request to the scheduler for dispatch
Three paths to the data stream--2:
3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader forms a response (Response, sent to engine via middleware
6 engine sends the received response via middleware to spider processing
Three paths to the data stream--3:
7 Spider generates a crawl item after processing a response (scraped item and a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler
- The entry and exit of the data stream and the part that the user needs to configure
Ii. comparison of Scrapy and requests libraries
Same point:
- Both can make page request and crawl, two important technical routes of Python crawler
- Both usability is good, documentation is rich, easy to get started
- Neither of them handles JS, submits a form, or is capable of verifying code (extensible)
Difference:
- Very small demand, requests library
- Less-than-small requirements, scrapy framework, the ability to continuously crawl information and accumulate it into its own crawl library
- High customization requirements (regardless of scale), self-framing, requests > Scrapy
Python crawler-scrapy crawler frame