Python exercises, network crawler framework Scrapy, pythonscrapy

Source: Internet
Author: User

[Switch] Python exercises, Web Crawler frameworks Scrapy and pythonscrapy

I. Overview

Shows the general architecture of Scrapy, including its main components and the data processing process of the system (green arrow shows ). The following describes the functions of each component and the data processing process.

 

Ii. Components

1. Scrapy Engine (Scrapy Engine)

The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.

2. sched)

The scheduler receives and sorts requests from the Scrapy engine into the queue, and returns the requests to the Scrapy engine after sending the requests.

3. Downloader)

The main responsibility of the download tool is to capture the webpage and return the webpage content to Spiders ).

4. Spiders)

A spider is a class defined by Scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.

The entire crawling process (cycle) of a spider is as follows:

5. Item Pipeline (project Pipeline)

The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a Python class consisting of a simple method. They get the project and execute their methods, and they also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.

The project pipeline generally performs the following processes:

6. Downloader middlewares (Downloader middleware)

Download middleware is a hook framework between the Scrapy engine and the download tool. It mainly processes requests and responses between the Scrapy engine and the download tool. It provides a custom code to expand Scrapy functions. The download intermediary is a hook framework for processing requests and responses. It is a lightweight underlying system that allows Scrapy to enjoy global control.

7. Spider middlewares (Spider middleware)

Spider middleware is a hook framework between the Scrapy engine and the spider. It mainly processes the spider's response input and request output. It provides a way to customize code to expand Scrapy functions. Spider middleware is a framework of spider processing mechanisms attached to Scrapy. you can insert custom code to send requests to SPIDER and return the response content and project obtained by the spider.

8. Scheduler middlewares (scheduling middleware)

Scheduling middleware is a middleware between the Scrapy engine and scheduling. It mainly serves to send scheduling requests and responses from the Scrapy engine. It provides a custom code to expand the Scrapy function.

3. Data Processing Process

The entire data processing process of Scrapy is controlled by the Scrapy engine. The main operation mode is as follows:

4. Drive

Scrapy is a popular Python event-driven network framework written by Twisted. It uses non-congested asynchronous processing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.