Python's Reptilian framework Scrapy__python

Source: Internet
Author: User
Tags in python
Web crawler, is the process of data crawling on the web, use it to crawl specific pages of HTML data. Although we use some libraries to develop a crawler program, the use of frameworks can greatly improve efficiency and shorten development time. Scrapy is written in Python, lightweight, simple and lightweight, and easy to use.


I. Overview


The following figure shows the general architecture of the Scrapy, which contains its main components and the data processing flow of the system (as shown in the green arrows). Here's how each component functions and how the data is processed (note: Pictures from the Internet).




Second, component

1, Scrapy Engine (Scrapy engine)

The Scrapy engine is used to control the data processing process of the whole system, and to trigger the transaction processing. More detailed information can be seen in the following data processing process.

2, Scheduler (Dispatch)

The scheduler accepts requests from the Scrapy engine and sorts them into queues, and returns them to them after the Scrapy engine makes a request.

3. Downloader (Download device)

The main responsibility of the downloader is to crawl the Web page and return the content to the spider (Spiders).

4, Spiders (spider)

Spiders are scrapy users define their own to parse the Web page and crawl to create a URL to return the content of the class, each spider can handle a domain name or a group of domain names. In other words, it is used to define the crawl and parse rules for a particular Web site.

The whole crawl process (cycle) of spiders is this:

First gets the initial request for the first URL, and then calls a callback function when the request returns. The first request is by calling the Start_requests () method. The method generates the request by default from the URL in Start_urls and performs a resolution to invoke the callback function.

In the callback function, you can parse the Web page response and return the project object and the request object or both iterations. These requests will also contain a callback, which is then downloaded by the scrapy and then handled with the specified callback.

In the callback function, you parse the content of the Web site using the XPath selector (but you can also use Beautifusoup, lxml, or any other program you like) and generate the parsed data items.

Finally, items returned from the spider are usually stationed in the project pipeline.


5, item Pipeline (Project pipeline)

The main responsibility of the project pipeline is to handle the items that the spider extracts from the Web page, and his main task is to clear, validate, and store the data. When the page is parsed by a spider, it is sent to the project pipeline and processed in a few specific order. Each project pipeline component is a Python class that has a simple way of composing it. They get the project and execute their methods, and they also need to determine whether they need to continue with the next step in the project pipeline or discard them directly.

The process that a project pipeline typically performs is:

Cleaning HTML Data
Verify the data that is resolved (check that the project contains the necessary fields)
Check for duplicate data (delete if repeated)
To store parsed data in a database


6, Downloader middlewares (download middleware)

The download middleware is a hook frame located between the Scrapy engine and the downloader, primarily to handle requests and responses between the Scrapy engine and the downloader. It provides a way to customize the code to extend the functionality of Scrapy. The download intermediary is a hook frame that handles requests and responses. He is a lightweight, low-level system that enjoys global control over the scrapy.

7, Spider Middlewares (spider middleware)

Spider Middleware is a hook frame between the scrapy engine and the spider, the main task is to handle the spider's response input and request output. It provides a way to customize the code to extend the functionality of Scrapy. Spider Middleware is a framework of spider handling mechanisms hooked up to scrapy, and you can insert custom code to handle requests sent to spiders and return spiders to retrieve response content and items.

8, Scheduler middlewares (scheduling middleware)

Scheduling middleware is a middleware between scrapy engine and dispatch, and the main work is to send the request and response from Scrapy engine to dispatch. He provides a custom code to expand the functionality of the scrapy.

Third, data processing process

Scrapy's entire data processing flow has scrapy engine control, its main operating mode is:

1. The engine opens a domain name, when the spider handles this domain name and lets the spider get the first crawl URL.

2. The engine gets the first URL to crawl from the spider, and then dispatches it as a request in the schedule.

3. The engine obtains the next crawling page from the dispatch.

4. The dispatch returns the next crawling URL to the engine, which sends them to the downloader via the download middleware.

5. When the Web page is downloaded by the download, the response content is sent to the engine via the download middleware.

6. The engine receives the response from the downloader and sends it through the spider middleware to the spider for processing.

7. The spider processes the response and returns the crawled item, and then sends a new request to the engine.

8. The engine will crawl to the project project pipeline and send the request to the dispatch.

9. The system repeats the second part of the operation until there is no request in the dispatch, and then disconnects the engine from the domain.

Four, drive

Scrapy is a popular Python event-driven network framework written by twisted, which uses asynchronous processing that is not blocked. For more information about asynchronous programming and twisted, refer to the following two links.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.