A Web crawl framework developed by Scrapy,python.
1, Introduction
The goal of Python's instant web crawler is to turn the Internet into a big database. Pure Open Source code is not the whole of open sources, the core of open source is "open mind", aggregation of the best ideas, technology, people, so will refer to a number of leadingproducts, such as Scrapy,scrapinghub,Import.io and so on.
This article briefly explains the architecture of the scrapy. Yes, the Universal extractor gsextractor is meant to be integrated into the scrapy architecture.
Please note that this article does not want to retell the original content, but for the development direction of the open source Python crawler to find references, and to develop the web crawler experience as a benchmark for more than 9 years, so this article contains many subjective comments, if you want to read Scrapy official text, Please click on the architecture of Scrapy official website.
2,scrapy Frame composition
650) this.width=650; "src=" Https://pic1.zhimg.com/8c591d54457bb033812a2b0364011e9c_b.png "class=" origin_image Zh-lightbox-thumb "alt=" 8c591d54457bb033812a2b0364011e9c_b.png "width=" "/>"
Spiders is a content extractor written for a specific target site, which is the most customized part of the generic web crawler framework. When using Scrapy to create a crawler project, a spider rack is generated, which simply fills in the code and fills it with its running mode, which can be integrated into the Scrapy's overall data stream. The goal of the Python instant web crawler Open source project is to save more than half the programmer's time, the key is to improve the spider's definition and testing speed, solution see 1 minutes to quickly generate a Web content extractor, so that the entire Scrapy crawler system to achieve the goal of rapid customization.
3,scrapy Data Flow
The data flow in the Scrapy is controlled by the execution engine, the following original digest from the Scrapy official website, I based on speculation to do reviews, for the further development of Gooseeker Open source Crawler directions:
URL who will prepare it? It looks like the spider is preparing itself, so you can guess that the Scrapy architecture section (not including the spider) mainly does event scheduling, regardless of the URL's storage. Looks like the Gooseeker member center of the crawler Compass, for the target site to prepare a batch of URLs, placed in the compass ready to perform crawler operation. So, the next goal of this open source project is to put the URL management in a centralized dispatch repository.
It's hard to understand what it's like to see a few other documents to understand. After the 1th, the engine from the spider to take the Web site after the package into a request, to the event loop, will be scheduler received to do scheduling management, for a moment to understand the request to do the queue. The engine is now looking for scheduler to download the next page address.
Requesting a task from the scheduler, handing over the requested task to the downloader, and having a downloader middleware between the downloader and the engine, is a necessary highlight of the development framework, where developers can make some custom extensions.
The download is complete, producing a response that is handed to the engine via the downloader middleware. Note that the first letter of the response and the previous request is capitalized, although I have not yet looked at other scrapy documents, but I suspect that this is an event object within the Scrapy framework, and can also be extrapolated as an asynchronous event-driven engine, like the three-level event loop of a DS hit machine. This is necessary for high-performance, low-overhead engines.
Once again, there is a middleware that gives developers plenty of room to play.
Each spider crawls one page at a sequential order, and then constructs another request event to start the crawl of another Web page.
Engine for event distribution.
Continue to operate continuously.
4, the next job
Next, we will further study Scrapy's documentation for the integration of Python's instant web crawler with Scrapy.
5, Document modification history
2016-06-11:v1.0, first release
This article from "Fullerhua blog" blog, declined reprint!
On the architecture of Scrapy