Python capture framework Scrapy architecture, pythonscrapy

Source: Internet
Author: User

Python capture framework Scrapy architecture, pythonscrapy

I recently learned how to capture data using Python, And I found Scrapy, a very popular python crawling framework. Next I will take a look at the Scrapy architecture, this tool is easy to use.

I. Overview

Shows the general architecture of Scrapy, including its main components and the data processing process of the system (green arrow shows ). The following describes the functions of each component and the data processing process.

Ii. Components

1. Scrapy Engine (Scrapy Engine)

The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.

2. sched)

The scheduler receives and sorts requests from the Scrapy engine into the queue, and returns the requests to the Scrapy engine after sending the requests.

3. Downloader)

The main responsibility of the download tool is to capture the webpage and return the webpage content to Spiders ).

4. Spiders)

A spider is a class defined by Scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.

The entire crawling process (cycle) of a spider is as follows:

1) first obtain the initial request of the first URL, and then call a callback function after the request is returned. The first request is to call the start_requests () method. By default, this method generates a request from the Url in start_urls and performs resolution to call the callback function.
2) In the callback function, you can parse the webpage response and return the iterations of the Project object and request object or the two. These requests will also contain a callback, Which is downloaded by Scrapy and then processed by the specified callback.
3 ). in the callback function, you parse the website content and use the Xpath selector in the same process (but you can also use BeautifuSoup, lxml or any other programs you like) and generate parsed data items.
4). Finally, projects returned from the spider are usually placed in the project pipeline.

5. Item Pipeline (project Pipeline)

The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a Python class consisting of a simple method. They get the project and execute their methods, and they also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.

The project pipeline generally performs the following processes:

1). Clean HTML data
2). Verify the parsed data (check whether the project contains necessary fields)
3). Check whether the data is duplicated (delete the data if it is duplicated)
4). Store the parsed data to the database.

6. Downloader middlewares (Downloader middleware)

Download middleware is a hook framework between the Scrapy engine and the download tool. It mainly processes requests and responses between the Scrapy engine and the download tool. It provides a custom code to expand Scrapy functions. The download intermediary is a hook framework for processing requests and responses. It is a lightweight underlying system that allows Scrapy to enjoy global control.

7. Spider middlewares (Spider middleware)

Spider middleware is a hook framework between the Scrapy engine and the spider. It mainly processes the spider's response input and request output. It provides a way to customize code to expand Scrapy functions. Spider middleware is a framework of spider processing mechanisms attached to Scrapy. you can insert custom code to send requests to SPIDER and return the response content and project obtained by the spider.

8. Scheduler middlewares (scheduling middleware)

Scheduling middleware is a middleware between the Scrapy engine and scheduling. It mainly serves to send scheduling requests and responses from the Scrapy engine. It provides a custom code to expand the Scrapy function.

3. Data Processing Process

The entire data processing process of Scrapy is controlled by the Scrapy engine. The main operation mode is as follows:

When the engine opens a domain name, the spider processes the domain name and asks the spider to obtain the first crawled URL.
The engine obtains the first URL to be crawled from the spider and then schedules the request as a request in scheduling.
The engine obtains the page for crawling from the scheduling.
The scheduler returns the next crawled URL to the engine, which sends them to the downloader through the download middleware.
After the webpage is downloaded by the download loader, the response content is sent to the engine through the download middleware.
The engine receives a response from the download tool and sends it to the spider through the spider middleware for processing.
The spider processes the response and returns the crawled project, and then sends a new request to the engine.
The engine captures the project pipeline and sends a request to the scheduler.
The system repeats the operation next to the second part until there is no request in the scheduling, and then disconnects the connection between the engine and the domain.

4. Drive

Scrapy is a popular Python event-driven network framework written by Twisted. It uses non-congested asynchronous processing.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.