1. Introduction
This article briefly explains the architecture of the scrapy. Yes, Gooseeker open source Universal extractor gsextractor is to be integrated into the scrapy architecture, the most important thing is the Scrapy event-driven extensible architecture. In addition to Scrapy, this group of research objects include scrapinghub,import.io and so on, the advanced ideas, technology introduced.
Please note that this article does not want to retell the original content, but for the development direction of the open source Python crawler to find references, and to develop the web crawler experience as a benchmark for more than 9 years, so this article contains a lot of author's subjective comments, if you want to learn scrapy Official text, please click on the Scrapy official website of the architecture.
2. Scrapy Architecture Diagram
\
Spiders is a content extractor written for a specific target site, which is the most customized part of the generic web crawler framework. When using Scrapy to create a crawler project, a spider rack is generated, which simply fills in the code and fills it with its running mode, which can be integrated into the Scrapy's overall data stream. Gooseeker The goal of the open source crawler is to save more than half the programmer's time, the key is to improve the spider's definition and testing speed, the solution see the 1-minute fast generation of Web content extractor, the entire Scrapy crawler system to achieve the goal of rapid customization.
3. Scrapy Data FlowThe data flow in the Scrapy is controlled by the execution engine, the following original digest from the Scrapy official website, I based on speculation to do reviews, for the further development of Gooseeker Open source Crawler directions:
The Engine gets the first URLs to crawl from the Spider and schedules
them in the Scheduler, as requests.
URL who will prepare it? It looks like the spider is preparing itself, so you can guess that the Scrapy architecture section (not including the spider) mainly does event scheduling, regardless of the URL's storage. Looks like the Gooseeker member center of the crawler Compass, for the target site to prepare a batch of URLs, placed in the compass ready to perform crawler operation. So, the next goal of this open source project is to put the URL management in a centralized dispatch repository.
The Engine asks the Scheduler for the next URLs to crawl.
It's hard to understand what it's like to see a few other documents to understand. After the 1th, the engine from the spider to take the Web site after the package into a request, to the event loop, will be scheduler received to do scheduling management, for a moment to understand the request to do the queue. The engine is looking for scheduler. The web address to be downloaded next
The Scheduler returns the next URLs to crawl to the Engine and the
Engine sends them to the Downloader, passing through the Downloader
Middleware (request direction).
Requesting a task from the scheduler, handing over the requested task to the downloader, and having a downloader middleware between the downloader and the engine, is a necessary highlight as a development framework, where developers can make some custom extensions
Once the page finishes downloading the Downloader generates a Response
and sends it to the Engine, passing through the
Downloader Middleware (response direction).
The download is complete, producing a response that is handed to the engine via the downloader middleware. Note that the first letter of the response and the previous request is capitalized, although I have not yet looked at other scrapy documents, but I suspect that this is an event object within the Scrapy framework, and can also be extrapolated as an asynchronous event-driven engine that, for high performance, low overhead engines, It's a must.
The Engine receives the Response from the Downloader and sends it to
The spider for processing, passing through the spider middleware
(input direction).
Once again, a middleware that gives developers plenty of room to play
The Spider processes the Response and returns scraped items and new
Requests (to follow) to the Engine.
Each spider sequentially crawls one page at a to construct another request event to start the crawl of another page
The Engine passes scraped items and new requests returned by a spider
Through Spider middleware (output direction), and then sends processed
Items to Item pipelines and processed requests to the Scheduler.
Engine for event distribution
The process repeats (from step 1) until there is no more requests
From the Scheduler.
Continue to operate continuously
4. The next workNext, we will further study Scrapy's documentation to enable the integration of Gooseeker open source Python crawlers with scrapy
5. History of Document Modification2016-06-11:v1.0, first release
Originally from: Segmentfault
Easy to understand scrapy architecture