Python crawler frame-scrappy

Source: Internet
Author: User

Python crawler framework has many kinds, but we often talk about using a few, today we will talk about Python crawler framework-scrapy is a fast, high-level, lightweight screen grab and Web Crawl python crawler framework that is used primarily to crawl information from a specific Web site and to extract structured data from the page.

because scrapy framework, and allows developers to modify the framework to suit their needs, enabling developers to develop more suitable python crawler. In addition,scrapy also offers a variety of crawler-based classes, including basespider,sitemap Crawlers, etc. The latest version also provides support for the web2.0 crawler. Let's take a detailed look at scrapy .

use of scrappy

Scrapy uses a wide range of applications, in addition to crawling Web site information and extracting structured data from the page, can also be used for data mining, monitoring, automated testing, information processing and history (history) packaging and so on.

components of the scrapy

1, the engine, to handle the entire system of data flow processing, triggering transactions.

2, the scheduler, used to accept the engine sent over the request, pressed into the queue, and the engine again when requested to return.

3,downloader, for downloading Web content, and return the content of the Web page to the spider.

4, spiders, spiders are the main work, use it to develop specific domain names or Web pages of the analytic rules.

5, the project pipeline, is responsible for the processing of spiders from the Web page extracted from the project, the main task is to clear, verify and store data. When the page is parsed by the spider, it is sent to the project pipeline, and the data is processed in several specific order.

6, the download middleware, located in the scrapy engine and the hook between the downloader framework, mainly to handle the scrapy engine and the download between the request and response.

7, Spider Middleware, between the scrapy engine and Spider hook frame, the main work is to deal with the spider's response input and request output.

8, scheduling middleware, between the scrapy engine and scheduling between the middleware, from the scrapy engine sent to the dispatch of the request and response.

Scrapy Data Processing flow

the data processing of Scrapy is controlled by the scrapy engine and its processing flow is:

1, the engine opens a domain name, the spider handles this domain name, and lets the spider get the first crawl URL.

2. Theengine gets the first URL to crawl from the spider , and then dispatches it as a request in the schedule.

3, the engine from the scheduling to get the next crawl of the page.

4.schedule the next crawl URL to be returned to the engine, and the engine sends them to the downloader via the download middleware.

5. When the Web page is downloaded by the downloader, the response content is sent to the engine via the download middleware.

6. The engine receives the response from the downloader and sends it through the spider middleware to the spider for processing.

7. The spider handles the response and returns the crawled item, and then sends a new request to the engine.

8. The engine will crawl into the project pipeline and send the request to the dispatch.

9, the system repeats the operation after the second, until there is no request in the dispatch, and then disconnects the engine from the domain.

Scrappy is a concise and efficient python crawler framework, which can be used to complete the online data collection process conveniently. Wheat Academy will soon launch scrappy Framework video tutorial, in-depth analysis of the application of scrappy framework, want to understand scrappy The framework of the latest knowledge points of children's shoes please pay attention.

Python crawler frame-scrappy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.