Python Crawler-scrapy Crawler Frame

Source: Internet
Author: User

2017-07-29 17:50:29

Scrapy is a fast and powerful web crawler framework.

Scrapy is not a function library, but a crawler frame. Crawler Framework is a collection of software structures and functional components that implement crawler functions. Crawler framework is a semi-finished product, can help users to achieve professional web crawler.

I. INTRODUCTION of SCRAPY Framework

    • 5+2 structure, 5 main modules plus 2 middleware.

(1) Engine: controls the flow of data between all modules and triggers events based on conditions . no user modification required

(2) Downloader: Download the webpage according to the request . no user modification required

(3) Scheduler: Scheduling management of all crawl requests . no user modification required

(4)Downloader middleware: implement user-configurable controls between the engine, scheduler, and Downloader to modify, discard, add requests, or respond . user can write configuration code

(5)Spider: Parses the response returned by Downloader (Response), generates a crawl item (scraped item), and generates an additional crawl request (request). requires user to write configuration code

(6)Item Pipelines: The crawl item generated by the spider is handled in a pipelined manner, consisting of a set of sequence of operations, such as a pipeline, each operation is an item pipeline type; Possible actions include: cleanup, Examine and check the HTML data in the crawl item, and store the data in the database . requires user to write configuration code

(7)Spider middleware: re-processing of requests and crawl items, making modifications, discards, new requests, or crawl items . user can write configuration code

    • Process Introduction

Three paths to the data stream--1:

1 engine get crawl request from spider
2 engine forwards the crawl request to the scheduler for dispatch

Three paths to the data stream--2:

3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader forms a response (Response, sent to engine via middleware
6 engine sends the received response via middleware to spider processing

Three paths to the data stream--3:

7 Spider generates a crawl item after processing a response (scraped item and a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler

    • The entry and exit of the data stream and the part that the user needs to configure

Ii. comparison of Scrapy and requests libraries

Same point:

    • Both can make page request and crawl, two important technical routes of Python crawler
    • Both usability is good, documentation is rich, easy to get started
    • Neither of them handles JS, submits a form, or is capable of verifying code (extensible)

Difference:

    • Very small demand, requests library
    • Less-than-small requirements, scrapy framework, the ability to continuously crawl information and accumulate it into its own crawl library
    • High customization requirements (regardless of scale), self-framing, requests > Scrapy

Python crawler-scrapy crawler frame

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.