Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

Source: Internet
Author: User

Transfer from http://blog.csdn.net/u012150179/article/details/34441655

The learning curve is always like this, and the simple example "tasting" is slowly breached from the theory + practice. The theory is always the foundation, remember "not in the floating sand build a plateau".


I. Core architecture

As for the core architecture, it is clearly stated in the official documentation, address: http://doc.scrapy.org/en/latest/topics/architecture.html.

English has the barrier to view the Chinese translation document, the author also participates in the Scraoy partial document translation, my translation github address: Https://github.com/younghz/scrapy_doc_chs. SOURCE Repo address: Https://github.com/marchtea/scrapy_doc_chs.

Some documents are reproduced directly below (address: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/architecture.html):


The following diagram shows the architecture of the scrapy, including an overview of the components and the data flows that occur in the system (shown by the green arrows). Here is a brief description of each component and a link to the detailed content. The data flow is described below.



Scrapy Architecture

Component
Scrapy Engine
The engine is responsible for controlling the flow of data across all components in the system and triggering events when the corresponding action occurs. For more information, see the Data Flow section below.

Scheduler (Scheduler)
The scheduler accepts the request from the engine and queue them up so that the engine is available to the engine when it requests them.

Downloader (Downloader)
The downloader is responsible for retrieving the page data and providing it to the engine, which is then provided to the spider.

Spiders
Spiders are classes that scrapy users write to parse the response and extract the item (that is, the item that gets to it) or an additional follow-up URL. Each spider is responsible for processing a specific (or some) Web site. For more information, see Spiders.

Item Pipeline
Item pipeline is responsible for handling the item that was extracted by the spider. Typical processing is cleanup, validation, and persistence (for example, access to a database). More content View Item Pipeline.

Downloader middleware (Downloader middlewares)
The downloader middleware is a specific hook between the engine and the downloader (specific hook) that handles the response passed downloader to the engine. It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See the Downloader middleware (Downloader middleware) for more information.

Spider Middleware (spider middlewares)
Spider middleware is a specific hook (specific hook) between the engine and spider that handles the spider's input (response) and output (items and requests). It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See Spider Middleware (middleware) for more information.



Traffic (data flow)

The data flow in Scrapy is controlled by the execution engine, with the following process:

1. The engine opens a website (open a domain), finds the spider that processes the site, and requests the first URL to crawl (s) to the spider.
2. The engine gets the first URL to crawl from the spider and dispatches it to request at the Scheduler (Scheduler).
3. The engine requests the next URL to be crawled to the scheduler.
4. The scheduler returns the next URL to crawl to the engine, and the engine forwards the URL to the downloader (Downloader) by downloading the middleware (request) direction.
5. Once the page has been downloaded, the downloader generates a response of the page and sends it to the engine via the download middleware (return (response) direction).
6. The engine receives the response from the downloader and sends it to spider processing via the spider middleware (input direction).
7.Spider processes the response and returns the crawled to the item and (follow up) the new request to the engine.
8. The engine crawls the item (which the spider returns) to item Pipeline, and the request to the scheduler (which is returned by the spider).
9. (from the second step) repeat until the scheduler does not have more request, the engine shuts down the site.

Two. Data flow and Code run analysis


This is the main analysis of the Data Flow section and is combined with code. Corresponds to the above process 1-9.

(1) Find spider--under the Spider folder to find the relevant definition crawler (2) engine get url--custom spider in Start_urls list get (3) ... (4) ... (5) through (3) (4) (5) in-house implementation based on the URL generated request, the downloader generated response based on the request process. namely url-"request-" reponse. (6) ... (7) It is important to call the default parse () method or the Parse_* () method in the custom spider to process the received reponse:

First, extract the item value.

Second, if you need to continue crawling, this will return the request to the engine. (This is the key to "auto" crawling multiple pages).

(8) (9) The engine continues to dispatch until there is no request.


Advanced:

The Scrapy architecture presents a star topology, and the engine acts as the core of the entire architecture, controlling the operation of the entire system.

Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.