Scrapy tutorial (iii) -- Scrapy core architecture and code running analysis

Last Update:2014-06-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The learning curve is always like this. A simple example is "simple taste", which is gradually broken down from theory + practice. Theory is always the foundation. Remember not to build a high platform in the sand float ".

I. Core Architecture

The core architecture is clearly described in the official document at http://doc.scrapy.org/en/latest/topics/ubunture.html.

If you have any problem in English, you can view the Chinese translation documents. I have also participated in the translation of some documents in Scraoy. My GitHub address is https://github.com/younghz/scrapy_doc_chs. Source repo address: https://github.com/marchtea/scrapy_doc_chs.

The following directly reproduced part of the document (Address: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/architecture.html ):

The following figure shows the Scrapy architecture, including the components and an overview of the data streams that occur in the system (green arrows show ). The following is a brief introduction to each component and a detailed link is provided. The data stream is described as follows.

Scrapy architecture

Components
Scrapy Engine
The engine controls the flow of data streams in all components of the system and triggers events when corresponding actions occur. For details, see the following Data Flow section.

Sched)
The scheduler receives requests from the engine and queues them so that they can be provided to the engine when the engine requests them.

Downloader)
The download tool obtains page data, provides it to the engine, and then provides it to the spider.

Spiders
Spider is a class written by Scrapy users to analyze response and extract items (that is, the obtained items) or URLs for additional follow-up. Each spider is responsible for processing a specific (or some) website. For more information, see Spiders.

Item Pipeline
Item Pipeline is responsible for processing items extracted by spider. Typical operations include cleaning, verification, and persistence (for example, accessing a database ). For more information, see Item Pipeline.

Downloader middlewares)
The download server middleware is a specific hook between the engine and the download server. It processes the response that Downloader passes to the engine. It provides a simple mechanism to extend the Scrapy function by inserting custom code. For more information, see Downloader Middleware ).

Spider middleware (Spider middlewares)
Spider middleware is a specific hook between the engine and the Spider, processing the spider input (response) and output (items and requests ). It provides a simple mechanism to extend the Scrapy function by inserting custom code. For more information, see Spider Middleware (Middleware ).

Data flow)

The Data Flow in Scrapy is controlled by the execution engine. The process is as follows:

1. open a website (open a domain), find the Spider processing the website, and request the first URL (s) to be crawled from the spider ).
2. The engine obtains the first URL to be crawled from the Spider and schedules the Request in the Scheduler (schedider.
3. The engine sends a request to the scheduler for the next url to be crawled.
4. The scheduler returns the next URL to be crawled to the engine. The engine forwards the URL to the Downloader by downloading middleware (request direction ).
5. Once the page is downloaded, the downloader generates a Response for the page and sends it to the engine through the download middleware (response.
6. The engine receives the Response from the downloader and sends it to the Spider for processing through the Spider middleware (input direction.
7. Spider processes Response and returns the crawled Item and (followed up) new Request to the engine.
8. The engine crawls the items returned by the Spider to the Item Pipeline and sends the requests returned by the Spider to the scheduler.
9. (step 2) Repeat until there are no more requests in the scheduler, and the engine closes the website.

Ii. Data Stream and code Operation Analysis

Here we mainly analyze the data stream and combine it with the code. It corresponds to the process 1-9 above.

(1) Search for spider -- search for the relevant definition crawler file in the spider folder (2) obtain the URL by the engine -- get it from the start_urls list in the Custom spider (3 )... (4 )... (5) through (3) (4) (5), a request is generated based on the URL internally, and the download server generates a response based on the request. That is, URL-> request-> reponse. (6 )... (7) In a custom spider, call the default parse () method or the specified parse _ * () method to process the received reponse. The processing result is very important:

First, extract the item value.

Second, if you need to continue crawling, the request is returned to the engine. (This is the key to "automatically" crawling multiple webpages ).

(8) (9) the engine continues scheduling until there is no request.

Advanced:

The Scrapy architecture has a star topology. The "engine" serves as the core of the entire architecture to coordinate and control the operation of the entire system.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy tutorial (iii) -- Scrapy core architecture and code running analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy tutorial (iii) -- Scrapy core architecture and code running analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support