Python crawler frame-scrappy

Last Update:2016-05-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler framework has many kinds, but we often talk about using a few, today we will talk about Python crawler framework-scrapy is a fast, high-level, lightweight screen grab and Web Crawl python crawler framework that is used primarily to crawl information from a specific Web site and to extract structured data from the page.

because scrapy framework, and allows developers to modify the framework to suit their needs, enabling developers to develop more suitable python crawler. In addition,scrapy also offers a variety of crawler-based classes, including basespider,sitemap Crawlers, etc. The latest version also provides support for the web2.0 crawler. Let's take a detailed look at scrapy .

use of scrappy

Scrapy uses a wide range of applications, in addition to crawling Web site information and extracting structured data from the page, can also be used for data mining, monitoring, automated testing, information processing and history (history) packaging and so on.

components of the scrapy

1, the engine, to handle the entire system of data flow processing, triggering transactions.

2, the scheduler, used to accept the engine sent over the request, pressed into the queue, and the engine again when requested to return.

3,downloader, for downloading Web content, and return the content of the Web page to the spider.

4, spiders, spiders are the main work, use it to develop specific domain names or Web pages of the analytic rules.

5, the project pipeline, is responsible for the processing of spiders from the Web page extracted from the project, the main task is to clear, verify and store data. When the page is parsed by the spider, it is sent to the project pipeline, and the data is processed in several specific order.

6, the download middleware, located in the scrapy engine and the hook between the downloader framework, mainly to handle the scrapy engine and the download between the request and response.

7, Spider Middleware, between the scrapy engine and Spider hook frame, the main work is to deal with the spider's response input and request output.

8, scheduling middleware, between the scrapy engine and scheduling between the middleware, from the scrapy engine sent to the dispatch of the request and response.

Scrapy Data Processing flow

the data processing of Scrapy is controlled by the scrapy engine and its processing flow is:

1, the engine opens a domain name, the spider handles this domain name, and lets the spider get the first crawl URL.

2. Theengine gets the first URL to crawl from the spider , and then dispatches it as a request in the schedule.

3, the engine from the scheduling to get the next crawl of the page.

4.schedule the next crawl URL to be returned to the engine, and the engine sends them to the downloader via the download middleware.

5. When the Web page is downloaded by the downloader, the response content is sent to the engine via the download middleware.

6. The engine receives the response from the downloader and sends it through the spider middleware to the spider for processing.

7. The spider handles the response and returns the crawled item, and then sends a new request to the engine.

8. The engine will crawl into the project pipeline and send the request to the dispatch.

9, the system repeats the operation after the second, until there is no request in the dispatch, and then disconnects the engine from the domain.

Scrappy is a concise and efficient python crawler framework, which can be used to complete the online data collection process conveniently. Wheat Academy will soon launch scrappy Framework video tutorial, in-depth analysis of the application of scrappy framework, want to understand scrappy The framework of the latest knowledge points of children's shoes please pay attention.

Python crawler frame-scrappy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler frame-scrappy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler frame-scrappy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support