Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

Last Update:2015-10-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transfer from http://blog.csdn.net/u012150179/article/details/34441655

The learning curve is always like this, and the simple example "tasting" is slowly breached from the theory + practice. The theory is always the foundation, remember "not in the floating sand build a plateau".

I. Core architecture

As for the core architecture, it is clearly stated in the official documentation, address: http://doc.scrapy.org/en/latest/topics/architecture.html.

English has the barrier to view the Chinese translation document, the author also participates in the Scraoy partial document translation, my translation github address: Https://github.com/younghz/scrapy_doc_chs. SOURCE Repo address: Https://github.com/marchtea/scrapy_doc_chs.

Some documents are reproduced directly below (address: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/architecture.html):

The following diagram shows the architecture of the scrapy, including an overview of the components and the data flows that occur in the system (shown by the green arrows). Here is a brief description of each component and a link to the detailed content. The data flow is described below.

Scrapy Architecture

Component
Scrapy Engine
The engine is responsible for controlling the flow of data across all components in the system and triggering events when the corresponding action occurs. For more information, see the Data Flow section below.

Scheduler (Scheduler)
The scheduler accepts the request from the engine and queue them up so that the engine is available to the engine when it requests them.

Downloader (Downloader)
The downloader is responsible for retrieving the page data and providing it to the engine, which is then provided to the spider.

Spiders
Spiders are classes that scrapy users write to parse the response and extract the item (that is, the item that gets to it) or an additional follow-up URL. Each spider is responsible for processing a specific (or some) Web site. For more information, see Spiders.

Item Pipeline
Item pipeline is responsible for handling the item that was extracted by the spider. Typical processing is cleanup, validation, and persistence (for example, access to a database). More content View Item Pipeline.

Downloader middleware (Downloader middlewares)
The downloader middleware is a specific hook between the engine and the downloader (specific hook) that handles the response passed downloader to the engine. It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See the Downloader middleware (Downloader middleware) for more information.

Spider Middleware (spider middlewares)
Spider middleware is a specific hook (specific hook) between the engine and spider that handles the spider's input (response) and output (items and requests). It provides a simple mechanism to extend the Scrapy functionality by inserting custom code. See Spider Middleware (middleware) for more information.

Traffic (data flow)

The data flow in Scrapy is controlled by the execution engine, with the following process:

1. The engine opens a website (open a domain), finds the spider that processes the site, and requests the first URL to crawl (s) to the spider.
2. The engine gets the first URL to crawl from the spider and dispatches it to request at the Scheduler (Scheduler).
3. The engine requests the next URL to be crawled to the scheduler.
4. The scheduler returns the next URL to crawl to the engine, and the engine forwards the URL to the downloader (Downloader) by downloading the middleware (request) direction.
5. Once the page has been downloaded, the downloader generates a response of the page and sends it to the engine via the download middleware (return (response) direction).
6. The engine receives the response from the downloader and sends it to spider processing via the spider middleware (input direction).
7.Spider processes the response and returns the crawled to the item and (follow up) the new request to the engine.
8. The engine crawls the item (which the spider returns) to item Pipeline, and the request to the scheduler (which is returned by the spider).
9. (from the second step) repeat until the scheduler does not have more request, the engine shuts down the site.

Two. Data flow and Code run analysis

This is the main analysis of the Data Flow section and is combined with code. Corresponds to the above process 1-9.

(1) Find spider--under the Spider folder to find the relevant definition crawler (2) engine get url--custom spider in Start_urls list get (3) ... (4) ... (5) through (3) (4) (5) in-house implementation based on the URL generated request, the downloader generated response based on the request process. namely url-"request-" reponse. (6) ... (7) It is important to call the default parse () method or the Parse_* () method in the custom spider to process the received reponse:

First, extract the item value.

Second, if you need to continue crawling, this will return the request to the engine. (This is the key to "auto" crawling multiple pages).

(8) (9) The engine continues to dispatch until there is no request.

Advanced:

The Scrapy architecture presents a star topology, and the engine acts as the core of the entire architecture, controlling the operation of the entire system.

Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Research and exploration on "Turn" Scrapy (iii)--scrapy core architecture and code operation analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support