[Switch] Python exercises, Web Crawler frameworks Scrapy and pythonscrapy
I. Overview
Shows the general architecture of Scrapy, including its main components and the data processing process of the system (green arrow shows ). The following describes the functions of each component and the data processing process.
Ii. Components
1. Scrapy Engine (Scrapy Engine)
The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.
2. sched)
The scheduler receives and sorts requests from the Scrapy engine into the queue, and returns the requests to the Scrapy engine after sending the requests.
3. Downloader)
The main responsibility of the download tool is to capture the webpage and return the webpage content to Spiders ).
4. Spiders)
A spider is a class defined by Scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.
The entire crawling process (cycle) of a spider is as follows:
5. Item Pipeline (project Pipeline)
The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a Python class consisting of a simple method. They get the project and execute their methods, and they also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.
The project pipeline generally performs the following processes:
6. Downloader middlewares (Downloader middleware)
Download middleware is a hook framework between the Scrapy engine and the download tool. It mainly processes requests and responses between the Scrapy engine and the download tool. It provides a custom code to expand Scrapy functions. The download intermediary is a hook framework for processing requests and responses. It is a lightweight underlying system that allows Scrapy to enjoy global control.
7. Spider middlewares (Spider middleware)
Spider middleware is a hook framework between the Scrapy engine and the spider. It mainly processes the spider's response input and request output. It provides a way to customize code to expand Scrapy functions. Spider middleware is a framework of spider processing mechanisms attached to Scrapy. you can insert custom code to send requests to SPIDER and return the response content and project obtained by the spider.
8. Scheduler middlewares (scheduling middleware)
Scheduling middleware is a middleware between the Scrapy engine and scheduling. It mainly serves to send scheduling requests and responses from the Scrapy engine. It provides a custom code to expand the Scrapy function.
3. Data Processing Process
The entire data processing process of Scrapy is controlled by the Scrapy engine. The main operation mode is as follows:
4. Drive
Scrapy is a popular Python event-driven network framework written by Twisted. It uses non-congested asynchronous processing.