On the architecture of Scrapy

Last Update:2016-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A Web crawl framework developed by Scrapy,python.

1, Introduction

The goal of Python's instant web crawler is to turn the Internet into a big database. Pure Open Source code is not the whole of open sources, the core of open source is "open mind", aggregation of the best ideas, technology, people, so will refer to a number of leadingproducts, such as Scrapy,scrapinghub,Import.io and so on.

This article briefly explains the architecture of the scrapy. Yes, the Universal extractor gsextractor is meant to be integrated into the scrapy architecture.

Please note that this article does not want to retell the original content, but for the development direction of the open source Python crawler to find references, and to develop the web crawler experience as a benchmark for more than 9 years, so this article contains many subjective comments, if you want to read Scrapy official text, Please click on the architecture of Scrapy official website.

2,scrapy Frame composition

650) this.width=650; "src=" Https://pic1.zhimg.com/8c591d54457bb033812a2b0364011e9c_b.png "class=" origin_image Zh-lightbox-thumb "alt=" 8c591d54457bb033812a2b0364011e9c_b.png "width=" "/>"

Spiders is a content extractor written for a specific target site, which is the most customized part of the generic web crawler framework. When using Scrapy to create a crawler project, a spider rack is generated, which simply fills in the code and fills it with its running mode, which can be integrated into the Scrapy's overall data stream. The goal of the Python instant web crawler Open source project is to save more than half the programmer's time, the key is to improve the spider's definition and testing speed, solution see 1 minutes to quickly generate a Web content extractor, so that the entire Scrapy crawler system to achieve the goal of rapid customization.

3,scrapy Data Flow

The data flow in the Scrapy is controlled by the execution engine, the following original digest from the Scrapy official website, I based on speculation to do reviews, for the further development of Gooseeker Open source Crawler directions:

The Engine gets the first URLs to crawl from the Spider and schedules them in the Scheduler, as requests.

URL who will prepare it? It looks like the spider is preparing itself, so you can guess that the Scrapy architecture section (not including the spider) mainly does event scheduling, regardless of the URL's storage. Looks like the Gooseeker member center of the crawler Compass, for the target site to prepare a batch of URLs, placed in the compass ready to perform crawler operation. So, the next goal of this open source project is to put the URL management in a centralized dispatch repository.

The Engine asks the Scheduler for the next URLs to crawl.

It's hard to understand what it's like to see a few other documents to understand. After the 1th, the engine from the spider to take the Web site after the package into a request, to the event loop, will be scheduler received to do scheduling management, for a moment to understand the request to do the queue. The engine is now looking for scheduler to download the next page address.

The Scheduler returns the next URLs to crawl-the engine and the engine sends them to the Downloader, passing through th e Downloader Middleware (request direction).

Requesting a task from the scheduler, handing over the requested task to the downloader, and having a downloader middleware between the downloader and the engine, is a necessary highlight of the development framework, where developers can make some custom extensions.

Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, Passin G through the Downloader middleware (response direction).

The download is complete, producing a response that is handed to the engine via the downloader middleware. Note that the first letter of the response and the previous request is capitalized, although I have not yet looked at other scrapy documents, but I suspect that this is an event object within the Scrapy framework, and can also be extrapolated as an asynchronous event-driven engine, like the three-level event loop of a DS hit machine. This is necessary for high-performance, low-overhead engines.

The Engine receives the Response from the Downloader and sends it to the spider for processing, passing through the spider Middleware (input direction).

Once again, there is a middleware that gives developers plenty of room to play.

The Spider processes the Response and returns scraped items and new requests (to follow) to the Engine.

Each spider crawls one page at a sequential order, and then constructs another request event to start the crawl of another Web page.

The Engine passes scraped items and new requests returned by a spider through spider middleware (output direction), and th En sends processed items to Item pipelines and processed requests to the Scheduler.

Engine for event distribution.

The process repeats (from step 1) until there is no more requests from the Scheduler.

Continue to operate continuously.

4, the next job

Next, we will further study Scrapy's documentation for the integration of Python's instant web crawler with Scrapy.

5, Document modification history

2016-06-11:v1.0, first release

This article from "Fullerhua blog" blog, declined reprint!

On the architecture of Scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

On the architecture of Scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

On the architecture of Scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support