Introduction to the Python_scarapy_01_scrapy architecture process

Source: Internet
Author: User

1, overview
    • Scrapy is an application framework written with pure Python for crawling Web site data and extracting structural data, which is very versatile.

    • The power of the framework, users only need to customize the development of a few modules can be easily implemented a crawler, used to crawl Web content and a variety of pictures, very convenient.

    • Scrapy uses the twisted[' tw?st?d] (its main opponent is Tornado), the asynchronous network framework to handle network traffic, can speed up our download speed, do not have to implement the asynchronous framework itself, and contains a variety of middleware interfaces, can be flexible to complete a variety of requirements.

2, Frame composition

requests sending process : In fact, requests is not the engine, is the engine issued instructions to scheduler, let scheduler to send requests to downloader, Downloader will download the corresponding download data, and the downloaded data to the spider, let spider processing data. After processing the data, the spider will give the useful data to pipeline, and if it is a subsequent requests (for example, the second page of the request), spiders will send the requests to the Scheduler,scheduler to send the request into the queue. Then hand it over to downloader to download the data.

Each module features:

    • Scrapy engine: Responsible for spider, Itempipeline, Downloader, scheduler in the middle of communication, signal, data transmission and so on. (All signals are published by the engine)

    • Scheduler (Scheduler): It is responsible for accepting requests sent by the engine and arranging them in a certain way, to queue up, and to return the engine when the engine needs it. Scheduler Important : 1. deduplication , if a duplicate request was made before, scheduler will recognize the request and no longer send it repeatedly. 2. Inter-Dispatch request , the request will go into the dispatch queue, and then out of the queue, to the downloader.

    • Downloader (Downloader): Responsible for downloading all requests requests sent by Scrapy engine (engines) and returning the responses they acquired to Scrapy engine , which is handled by the engine to the spider, the

    • Spider (crawler): It handles all responses, extracts data from it, gets the data needed for the item field to pipeline, and submit the URL that needs to be followed (requests) to the engine, again into the Scheduler (scheduler),

    • Item Pipeline (pipeline): it is responsible for handling the item obtained in the Spider, and carry out post-processing (detailed analysis, filtering, storage, etc.).

    • Downloader middlewares ( download middleware: Can customize the request ): You can consider it a component that can customize the extended download functionality.

    • Spider middlewares ( spider Middleware: can handle communication between engine and Spider ): You can be understood as a functional component that can customize the extension and manipulation engine and the Spider's Intermediate communication (e.g. responses into the spider, and requests from the spider)

3, Operation Process

The code is written, the program begins to run ...

1 Engine: hi! Spider, which website do you want to work on? 2 3 Spider: Boss wants me to deal with xxxx.com. 4 5 Engine: You give me the first URL you need to deal with. 6 7 Spider: To you, the first URL is xxxxxxx.com. 8 9 Engine: hi! Scheduler, I have a request for you to help me sort the queue. Ten  One Scheduler: OK, I'm dealing with you. Wait a minute.  A  - Engine: hi! Scheduler, give me the request you handled.  -  the Scheduler: Here you are, this is the request I handled. -  - Engine: hi! Downloader, you follow the boss's download middleware settings to help me download the request -  + Downloader: OK! Here you are, this is the download good thing. (If failure: Sorry, this request download failed.) Then the engine tells the scheduler that the request download failed, and you record it, we'll download it later. -  + Engine: hi! Spider, this is the download good thing, and has been in accordance with the boss of the download middleware has been processed, you handle it yourself (note!) Here responses default is given to Def parse () This function is processed) A  at Spider: (after processing the data for URLs that need to be followed up), hi! Engine, I have two results here, this is the URL I need to follow up with, and this is the item data I get.  -  - Engine: Hi! Pipeline I have an item for you to handle for me! Scheduler! It is necessary to follow up the URL you help me to deal with. Then start the loop from step fourth until the boss needs all the information.  -  -Pipeline ' Scheduler: OK, do it now!

Note: The entire program stops only if the scheduler does not need to process the request. (The url,scrapy that failed to download will also be downloaded again.) )

The four steps of 4,scrapy crawler
    • First Step: New project (scrapy startproject XXX): Create a new crawler project
    • Step two: Clear goals (write items.py): Identify the target you want to crawl
    • Step three: Making Crawlers (spiders/xxspider.py): Making crawlers start crawling Web pages
    • Fourth Step: Storage content (pipelines.py): Design Pipeline Store crawl content

Introduction to the Python_scarapy_01_scrapy architecture process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.