Research and exploration of Scrapy (III.) analysis of--scrapy core architecture and code operation

Last Update:2018-07-28 Source: Internet

Author: User

Tags extend

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning curve is always like this, simple example "taste", in from theory + practice slowly break through. The theory is always the foundation, remember "do not build a platform in the floating sand".

I. Core Framework

Regarding the core architecture, it is clearly stated in the official document, address: http://doc.scrapy.org/en/latest/topics/architecture.html.

English has the barrier to view the Chinese translation document, the author also participates in the Scraoy partial document translation, my translation github address: Https://github.com/younghz/scrapy_doc_chs. SOURCE Repo address: Https://github.com/marchtea/scrapy_doc_chs.

The following directly reproduced part of the document (address: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/architecture.html):

Overview The following diagram shows the architecture of the scrapy, including an overview of the components and the data flows that occur in the system (as shown in the green arrows). The following is a brief introduction to each of the components and a link to the detailed content. The data flow is described as follows.

scrapy Architecture

Component
Scrapy Engine
The engine is responsible for controlling the flow of data in all components of the system and triggering events when the corresponding action occurs. See the Data Flow section below for more information.

Scheduler (Scheduler)
The scheduler accepts requests from the engine and takes them on the team so that the engine can be supplied to the engine upon request.

Download (Downloader)
The downloader is responsible for acquiring the page data and providing it to the engine, which is then provided to Spider.

Spiders
Spider is a class that scrapy users write to parse the response and extract the item (that is, the item that was fetched) or the URL for an extra follow-up. Each spider is responsible for handling a specific (or some) Web site. Please see Spiders for more information.

Item Pipeline
Item pipeline is responsible for processing the item that is extracted by the spider. Typical processing is cleanup, validation, and persistence (for example, access to a database). See Item Pipeline for more information.

Download middleware (Downloader middlewares)
The downloader middleware is a specific hook (specific hook) between the engine and the downloader, handling the response that downloader passes to the engine. It provides an easy mechanism to extend the Scrapy functionality by inserting custom code. For more information, see the Download middleware (Downloader middleware).

Spider Middleware (Spider middlewares)
The spider middleware is a specific hook between the engine and the spider (Specific hook), which handles spider input (response) and output (items and requests). It provides an easy mechanism to extend the Scrapy functionality by inserting custom code. See Spider Middleware (middleware) for more information.

Data Flow

The data flow in Scrapy is controlled by the execution engine, and the process is as follows:

1. The engine opens a Web site (open a domain), finds the spider that handles the site, and requests the first URL to crawl to the spider (s).
2. The engine obtains the first URL to crawl from the spider and dispatches it to the Scheduler (Scheduler) at request.
3. The engine requests the next URL to crawl to the scheduler.
4. The dispatcher returns the next URL to crawl to the engine, and the engine sends the URL to the Downloader (downloader) by downloading the middleware (request) direction.
5. Once the page has been downloaded, the downloader generates a response of the page and sends it to the engine via the download middleware (return (response) direction).
6. The engine receives response from the downloader and sends it to spider processing through the spider middleware (input direction).
7.Spider handles the response and returns the crawled item and (follow-up) new request to the engine.
8. The engine will (spider returned) crawled item to item Pipeline, will (Spider return) request to the dispatcher.
9. Repeat (from step two) until there is no more request in the scheduler, the engine closes the site.

two. Data flow and code operation analysis

Here we analyze the Data Flow section and combine it with the code. Corresponds to the process 1-9 above. (1) Find spider--in the Spider folder to find the relevant definition crawler files (2) engine get url--custom spider in start_urls list (3) ... (4) ... (5) through (3) (4) (5) on the internal implementation of the URL to generate request, the download based on request to generate response this process. namely url-"request-" reponse. (6) ... (7) The default parse () method is invoked in a custom spider or a parse_* () method is developed to handle the received reponse, the results of which are important:

First, extract the item value.

Second, if you need to continue crawling, this will return the request to the engine. (This is the key to "automatically" crawling multiple Web pages.) (8) (9) The engine continues to dispatch until no request.

Advanced Step:

The Scrapy architecture presents a star topology, and "engine" acts as the core of the entire architecture and controls the operation of the entire system.

Original, reprinted annotated: http://blog.csdn.net/u012150179/article/details/34441655

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More