Scrapy tutorial (iii) -- Scrapy core architecture and code running analysis

Source: Internet
Author: User

The learning curve is always like this. A simple example is "simple taste", which is gradually broken down from theory + practice. Theory is always the foundation. Remember not to build a high platform in the sand float ".


I. Core Architecture

The core architecture is clearly described in the official document at http://doc.scrapy.org/en/latest/topics/ubunture.html.

If you have any problem in English, you can view the Chinese translation documents. I have also participated in the translation of some documents in Scraoy. My GitHub address is https://github.com/younghz/scrapy_doc_chs. Source repo address: https://github.com/marchtea/scrapy_doc_chs.

The following directly reproduced part of the document (Address: http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/architecture.html ):


The following figure shows the Scrapy architecture, including the components and an overview of the data streams that occur in the system (green arrows show ). The following is a brief introduction to each component and a detailed link is provided. The data stream is described as follows.



Scrapy architecture

Components
Scrapy Engine
The engine controls the flow of data streams in all components of the system and triggers events when corresponding actions occur. For details, see the following Data Flow section.

Sched)
The scheduler receives requests from the engine and queues them so that they can be provided to the engine when the engine requests them.

Downloader)
The download tool obtains page data, provides it to the engine, and then provides it to the spider.

Spiders
Spider is a class written by Scrapy users to analyze response and extract items (that is, the obtained items) or URLs for additional follow-up. Each spider is responsible for processing a specific (or some) website. For more information, see Spiders.

Item Pipeline
Item Pipeline is responsible for processing items extracted by spider. Typical operations include cleaning, verification, and persistence (for example, accessing a database ). For more information, see Item Pipeline.

Downloader middlewares)
The download server middleware is a specific hook between the engine and the download server. It processes the response that Downloader passes to the engine. It provides a simple mechanism to extend the Scrapy function by inserting custom code. For more information, see Downloader Middleware ).

Spider middleware (Spider middlewares)
Spider middleware is a specific hook between the engine and the Spider, processing the spider input (response) and output (items and requests ). It provides a simple mechanism to extend the Scrapy function by inserting custom code. For more information, see Spider Middleware (Middleware ).



Data flow)

The Data Flow in Scrapy is controlled by the execution engine. The process is as follows:

1. open a website (open a domain), find the Spider processing the website, and request the first URL (s) to be crawled from the spider ).
2. The engine obtains the first URL to be crawled from the Spider and schedules the Request in the Scheduler (schedider.
3. The engine sends a request to the scheduler for the next url to be crawled.
4. The scheduler returns the next URL to be crawled to the engine. The engine forwards the URL to the Downloader by downloading middleware (request direction ).
5. Once the page is downloaded, the downloader generates a Response for the page and sends it to the engine through the download middleware (response.
6. The engine receives the Response from the downloader and sends it to the Spider for processing through the Spider middleware (input direction.
7. Spider processes Response and returns the crawled Item and (followed up) new Request to the engine.
8. The engine crawls the items returned by the Spider to the Item Pipeline and sends the requests returned by the Spider to the scheduler.
9. (step 2) Repeat until there are no more requests in the scheduler, and the engine closes the website.

Ii. Data Stream and code Operation Analysis


Here we mainly analyze the data stream and combine it with the code. It corresponds to the process 1-9 above.

(1) Search for spider -- search for the relevant definition crawler file in the spider folder (2) obtain the URL by the engine -- get it from the start_urls list in the Custom spider (3 )... (4 )... (5) through (3) (4) (5), a request is generated based on the URL internally, and the download server generates a response based on the request. That is, URL-> request-> reponse. (6 )... (7) In a custom spider, call the default parse () method or the specified parse _ * () method to process the received reponse. The processing result is very important:

First, extract the item value.

Second, if you need to continue crawling, the request is returned to the engine. (This is the key to "automatically" crawling multiple webpages ).

(8) (9) the engine continues scheduling until there is no request.


Advanced:

The Scrapy architecture has a star topology. The "engine" serves as the core of the entire architecture to coordinate and control the operation of the entire system.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.