International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python Crawler-scrapy Crawler Frame

Last Update:2017-08-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2017-07-29 17:50:29

Scrapy is a fast and powerful web crawler framework.

Scrapy is not a function library, but a crawler frame. Crawler Framework is a collection of software structures and functional components that implement crawler functions. Crawler framework is a semi-finished product, can help users to achieve professional web crawler.

I. INTRODUCTION of SCRAPY Framework

5+2 structure, 5 main modules plus 2 middleware.

(1) Engine: controls the flow of data between all modules and triggers events based on conditions . no user modification required

(2) Downloader: Download the webpage according to the request . no user modification required

(3) Scheduler: Scheduling management of all crawl requests . no user modification required

(4)Downloader middleware: implement user-configurable controls between the engine, scheduler, and Downloader to modify, discard, add requests, or respond . user can write configuration code

(5)Spider: Parses the response returned by Downloader (Response), generates a crawl item (scraped item), and generates an additional crawl request (request). requires user to write configuration code

(6)Item Pipelines: The crawl item generated by the spider is handled in a pipelined manner, consisting of a set of sequence of operations, such as a pipeline, each operation is an item pipeline type; Possible actions include: cleanup, Examine and check the HTML data in the crawl item, and store the data in the database . requires user to write configuration code

(7)Spider middleware: re-processing of requests and crawl items, making modifications, discards, new requests, or crawl items . user can write configuration code

Process Introduction

Three paths to the data stream--1:

1 engine get crawl request from spider
2 engine forwards the crawl request to the scheduler for dispatch

Three paths to the data stream--2:

3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader forms a response (Response, sent to engine via middleware
6 engine sends the received response via middleware to spider processing

Three paths to the data stream--3:

7 Spider generates a crawl item after processing a response (scraped item and a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler

The entry and exit of the data stream and the part that the user needs to configure

Ii. comparison of Scrapy and requests libraries

Same point:

Both can make page request and crawl, two important technical routes of Python crawler
Both usability is good, documentation is rich, easy to get started
Neither of them handles JS, submits a form, or is capable of verifying code (extensible)

Difference:

Very small demand, requests library
Less-than-small requirements, scrapy framework, the ability to continuously crawl information and accumulate it into its own crawl library
High customization requirements (regardless of scale), self-framing, requests > Scrapy

Python crawler-scrapy crawler frame

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python abstract class (ABC module) 09-18

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler-scrapy Crawler Frame

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support