On the architecture of Scrapy

Source: Internet
Author: User

A Web crawl framework developed by Scrapy,python.


1, Introduction

The goal of Python's instant web crawler is to turn the Internet into a big database. Pure Open Source code is not the whole of open sources, the core of open source is "open mind", aggregation of the best ideas, technology, people, so will refer to a number of leadingproducts, such as Scrapy,scrapinghub,Import.io and so on.


This article briefly explains the architecture of the scrapy. Yes, the Universal extractor gsextractor is meant to be integrated into the scrapy architecture.

Please note that this article does not want to retell the original content, but for the development direction of the open source Python crawler to find references, and to develop the web crawler experience as a benchmark for more than 9 years, so this article contains many subjective comments, if you want to read Scrapy official text, Please click on the architecture of Scrapy official website.

2,scrapy Frame composition

650) this.width=650; "src=" Https://pic1.zhimg.com/8c591d54457bb033812a2b0364011e9c_b.png "class=" origin_image Zh-lightbox-thumb "alt=" 8c591d54457bb033812a2b0364011e9c_b.png "width=" "/>"

Spiders is a content extractor written for a specific target site, which is the most customized part of the generic web crawler framework. When using Scrapy to create a crawler project, a spider rack is generated, which simply fills in the code and fills it with its running mode, which can be integrated into the Scrapy's overall data stream. The goal of the Python instant web crawler Open source project is to save more than half the programmer's time, the key is to improve the spider's definition and testing speed, solution see 1 minutes to quickly generate a Web content extractor, so that the entire Scrapy crawler system to achieve the goal of rapid customization.


3,scrapy Data Flow


The data flow in the Scrapy is controlled by the execution engine, the following original digest from the Scrapy official website, I based on speculation to do reviews, for the further development of Gooseeker Open source Crawler directions:

    • The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as requests.

URL who will prepare it? It looks like the spider is preparing itself, so you can guess that the Scrapy architecture section (not including the spider) mainly does event scheduling, regardless of the URL's storage. Looks like the Gooseeker member center of the crawler Compass, for the target site to prepare a batch of URLs, placed in the compass ready to perform crawler operation. So, the next goal of this open source project is to put the URL management in a centralized dispatch repository.

    • The Engine asks the Scheduler for the next URLs to crawl.

It's hard to understand what it's like to see a few other documents to understand. After the 1th, the engine from the spider to take the Web site after the package into a request, to the event loop, will be scheduler received to do scheduling management, for a moment to understand the request to do the queue. The engine is now looking for scheduler to download the next page address.

    • The Scheduler returns the next URLs to crawl-the engine and the engine sends them to the Downloader, passing through th e Downloader Middleware (request direction).

Requesting a task from the scheduler, handing over the requested task to the downloader, and having a downloader middleware between the downloader and the engine, is a necessary highlight of the development framework, where developers can make some custom extensions.

    • Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, Passin G through the Downloader middleware (response direction).

The download is complete, producing a response that is handed to the engine via the downloader middleware. Note that the first letter of the response and the previous request is capitalized, although I have not yet looked at other scrapy documents, but I suspect that this is an event object within the Scrapy framework, and can also be extrapolated as an asynchronous event-driven engine, like the three-level event loop of a DS hit machine. This is necessary for high-performance, low-overhead engines.

    • The Engine receives the Response from the Downloader and sends it to the spider for processing, passing through the spider Middleware (input direction).

Once again, there is a middleware that gives developers plenty of room to play.

    • The Spider processes the Response and returns scraped items and new requests (to follow) to the Engine.

Each spider crawls one page at a sequential order, and then constructs another request event to start the crawl of another Web page.

    • The Engine passes scraped items and new requests returned by a spider through spider middleware (output direction), and th En sends processed items to Item pipelines and processed requests to the Scheduler.

Engine for event distribution.

    • The process repeats (from step 1) until there is no more requests from the Scheduler.

Continue to operate continuously.

4, the next job

Next, we will further study Scrapy's documentation for the integration of Python's instant web crawler with Scrapy.


5, Document modification history

2016-06-11:v1.0, first release


This article from "Fullerhua blog" blog, declined reprint!

On the architecture of Scrapy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.