Review
After the crawler, we have two paths to go.
One is to continue in-depth study, as well as some knowledge about design patterns, to strengthen the knowledge of Python, DIY wheels, continue to add to their own crawler distributed, multi-threaded functions such as extension. The other way is to learn some excellent framework, the first to use these frameworks cooked, can be sure to cope with some basic crawler tasks, so-called to solve the problem of food and clothing, and then further study its source knowledge, and further strengthen.
Personally, the former method is actually to build their own wheels, the predecessors have actually had some relatively good framework, can be directly used, but in order to be able to study more in-depth and the crawler have a more comprehensive understanding of their own hands to do more. The latter method is to take directly the predecessors have written a better framework, to use the good, first of all to ensure that you can complete the task you want to complete, and then further research study. First of all, the more you explore, the knowledge of the crawler will be more thorough. The second, take others to use, their own convenience, but may not have a deep study of the framework of the mood, there may be ideas are bound.
Personally, however, I am biased towards the latter. It's nice to build wheels, but even if you build wheels, aren't you making wheels on the base class library? Can be used to use, learn the role of the framework is to ensure that they can meet some crawler needs, this is the most basic food and clothing problems. If you have been building the wheel, to the end did not create anything, others to you to write a reptile study for so long time can not write out, it is not worth the candle? So, advanced crawler I still suggest to learn the framework, as a few of their own weapons. At least, we can do it, like you took a gun to the battlefield, at least, you can hit the enemy, more than you have been sharpening better?
Framework Overview
Bo Master contacted a few reptile frame, which is more useful is scrapy and pyspider. Personally, Pyspider is simpler and easier to operate because it adds a WEB interface, fast crawler, integrated PHANTOMJS, and can be used to crawl JS rendered pages. Scrapy a high degree of customization, more than the bottom of pyspider, suitable for learning research, need to learn a lot of relevant knowledge, but it is very appropriate to study distributed and multi-threaded and so on.
Bloggers will be here to write their own learning experience to share with you, I hope you can like, but also hope to give you some help.
Pyspider
Pyspider is an open-source implementation of a reptile architecture made by Binux. The main functional requirements are:
- Crawl and update specific pages that dispatch multiple sites
- Need to extract structured information for a page
- Flexible and scalable, stable and reliable monitoring
This is also the demand of the vast majority of Python crawler-oriented crawl, structural analysis. But in the face of different structure of the various sites, a single crawl mode is not certain to meet, flexible crawl control is necessary. In order to achieve this goal, the simple configuration file is often not flexible, so, through the script to control the crawl is the final choice.
While the go-to-reschedule, queue, crawl, exception handling, monitoring functions as a framework, provided to the crawl script, and to ensure flexibility. Finally, with the editing and debugging environment of the web, as well as web task monitoring, it became the framework.
Pyspider is designed based on a python script-driven grab ring model crawler
- Using Python scripts to extract structured information, follow link scheduling crawl control for maximum flexibility
- Write and debug your environment through a web-based script. Web Presentation scheduling status
- The capture ring model is mature and stable, the modules are independent from each other, and are connected by message queue, from single process to multi-machine distributed flexibly.
Pyspider architecture is mainly divided into scheduler (scheduler), Fetcher (crawler), processor (script execution):
- Message Queuing connections are used between components, except that scheduler is a single point, and Fetcher and processor are multi-instance distributed deployments. Scheduler is responsible for the overall scheduling control
- The task is scheduled by scheduler, Fetcher crawl the Web content, processor execute the pre-written Python script, output the result or generate a new chain task (sent to scheduler), forming a closed loop.
- Each script has the flexibility to parse the page using a variety of Python libraries, use the framework API to control the next crawl action, and set the callback to control parsing actions.
Scrapy
Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing
Scrapy uses the Twisted asynchronous network library to handle network traffic. The overall structure is broadly as follows
Scrapy mainly includes the following components:
- Engine (scrapy): Used to handle the entire system of data flow processing, triggering transactions (framework core)
- Scheduler (Scheduler): Used to accept requests sent by the engine, pressed into the queue, and returned when the engine was requested again. It can be imagined as a priority queue for a URL (crawling the URL of a Web page or a link), which determines what the next URL to crawl is, and removes duplicate URLs
- Downloader (Downloader): Used to download Web content and return Web content to spiders (Scrapy downloader is built on twisted, an efficient asynchronous model)
- Crawler (Spiders): Crawlers are primarily working to extract the information they need from a particular Web page, the so-called entity (Item). The user can also extract a link from it, allowing Scrapy to continue scratching a page
- Project Pipeline (Pipeline): Responsible for dealing with the entities extracted from Web pages by crawlers, the main function is to persist entities, verify the validity of entities, and eliminate unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific order.
- Downloader middleware (Downloader middlewares): A framework between the Scrapy engine and the downloader, primarily dealing with requests and responses between the Scrapy engine and the downloader.
- Crawler middleware (spider middlewares): A framework between the Scrapy engine and the crawler, the main task is to handle the spider's response input and request output.
- Dispatch middleware (Scheduler middewares): A middleware between the scrapy engine and scheduling, sent from the Scrapy engine to the scheduled request and response.
The scrapy running process is probably as follows:
- First, the engine pulls a link (URL) from the scheduler for the next crawl
- The engine encapsulates the URL as a request to the downloader, and the downloader downloads the resource and encapsulates it as a response packet (Response)
- Then, the crawler parses response
- If the entity (Item) is parsed, it is given to the entity pipeline for further processing.
- If the parse is a link (URL), then the URL is given to scheduler waiting to crawl
Conclusion
After the basic introduction of these two frameworks, I will introduce the installation of these two frameworks and the use of the framework, I hope to be helpful to everyone.
Reprint: Quiet Find? Python crawler Advanced one crawler Framework Overview
Python crawler Advanced one crawler Framework Overview