Introduction to the Python_scarapy_01_scrapy architecture process

Last Update:2018-07-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, overview

Scrapy is an application framework written with pure Python for crawling Web site data and extracting structural data, which is very versatile.
The power of the framework, users only need to customize the development of a few modules can be easily implemented a crawler, used to crawl Web content and a variety of pictures, very convenient.
Scrapy uses the twisted[' tw?st?d] (its main opponent is Tornado), the asynchronous network framework to handle network traffic, can speed up our download speed, do not have to implement the asynchronous framework itself, and contains a variety of middleware interfaces, can be flexible to complete a variety of requirements.

2, Frame composition

requests sending process : In fact, requests is not the engine, is the engine issued instructions to scheduler, let scheduler to send requests to downloader, Downloader will download the corresponding download data, and the downloaded data to the spider, let spider processing data. After processing the data, the spider will give the useful data to pipeline, and if it is a subsequent requests (for example, the second page of the request), spiders will send the requests to the Scheduler,scheduler to send the request into the queue. Then hand it over to downloader to download the data.

Each module features:

Scrapy engine: Responsible for spider, Itempipeline, Downloader, scheduler in the middle of communication, signal, data transmission and so on. (All signals are published by the engine)
Scheduler (Scheduler): It is responsible for accepting requests sent by the engine and arranging them in a certain way, to queue up, and to return the engine when the engine needs it. Scheduler Important : 1. deduplication , if a duplicate request was made before, scheduler will recognize the request and no longer send it repeatedly. 2. Inter-Dispatch request , the request will go into the dispatch queue, and then out of the queue, to the downloader.
Downloader (Downloader): Responsible for downloading all requests requests sent by Scrapy engine (engines) and returning the responses they acquired to Scrapy engine , which is handled by the engine to the spider, the
Spider (crawler): It handles all responses, extracts data from it, gets the data needed for the item field to pipeline, and submit the URL that needs to be followed (requests) to the engine, again into the Scheduler (scheduler),
Item Pipeline (pipeline): it is responsible for handling the item obtained in the Spider, and carry out post-processing (detailed analysis, filtering, storage, etc.).
Downloader middlewares ( download middleware: Can customize the request ): You can consider it a component that can customize the extended download functionality.
Spider middlewares ( spider Middleware: can handle communication between engine and Spider ): You can be understood as a functional component that can customize the extension and manipulation engine and the Spider's Intermediate communication (e.g. responses into the spider, and requests from the spider)

3, Operation Process

The code is written, the program begins to run ...

1 Engine: hi! Spider, which website do you want to work on? 2 3 Spider: Boss wants me to deal with xxxx.com. 4 5 Engine: You give me the first URL you need to deal with. 6 7 Spider: To you, the first URL is xxxxxxx.com. 8 9 Engine: hi! Scheduler, I have a request for you to help me sort the queue. Ten One Scheduler: OK, I'm dealing with you. Wait a minute. A - Engine: hi! Scheduler, give me the request you handled. - the Scheduler: Here you are, this is the request I handled. - - Engine: hi! Downloader, you follow the boss's download middleware settings to help me download the request - + Downloader: OK! Here you are, this is the download good thing. (If failure: Sorry, this request download failed.) Then the engine tells the scheduler that the request download failed, and you record it, we'll download it later. - + Engine: hi! Spider, this is the download good thing, and has been in accordance with the boss of the download middleware has been processed, you handle it yourself (note!) Here responses default is given to Def parse () This function is processed) A at Spider: (after processing the data for URLs that need to be followed up), hi! Engine, I have two results here, this is the URL I need to follow up with, and this is the item data I get. - - Engine: Hi! Pipeline I have an item for you to handle for me! Scheduler! It is necessary to follow up the URL you help me to deal with. Then start the loop from step fourth until the boss needs all the information. - -Pipeline ' Scheduler: OK, do it now!

Note: The entire program stops only if the scheduler does not need to process the request. (The url,scrapy that failed to download will also be downloaded again.) ）

The four steps of 4,scrapy crawler

First Step: New project (scrapy startproject XXX): Create a new crawler project
Step two: Clear goals (write items.py): Identify the target you want to crawl
Step three: Making Crawlers (spiders/xxspider.py): Making crawlers start crawling Web pages
Fourth Step: Storage content (pipelines.py): Design Pipeline Store crawl content

Introduction to the Python_scarapy_01_scrapy architecture process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to the Python_scarapy_01_scrapy architecture process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to the Python_scarapy_01_scrapy architecture process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support