Scrapy Getting Started

Source: Internet
Author: User

Scrapy mainly includes the following components :

engine : Used to process the entire system of data flow processing, triggering transactions.

Scheduler : Used to accept requests sent by the engine, pressed into the queue, and returned when the engine requests again

Downloader : Used to download Web content and return the contents of the Web page to the spider.

spider : Spider is the main work, use it to make a specific domain name or Web page parsing rules

Project Pipeline : Responsible for handling projects with spiders from Web pages, his main task is to clear, verify and store data. When the page is parsed by the spider, it will be sent to the project pipeline and processed in a specific order by passing the data.

Downloader middleware : The hook framework located between the Scrapy engine and the downloader, mainly handles the request between the Scrapy engine and the downloader and the corresponding.

Spider middleware : The hook frame between the scrapy engine and the spider, the main work is to handle the spider's response input and request input.

Dispatch middleware : Middleware between the scrapy engine and scheduling, from the Scrapy engine to the scheduled request and response


Create a scrapy project :

Before you start crawling, you must create a new Scrapy project.

Again into the directory where you intend to store the code, run the following command:

scrapy startproject Tutorial

The command will create the Tutorial directory for the following:

tutorial/

Scrapy.cfg

tutorial/

__init__.py

items.py

pipelines.py

settings.py

spiders/

__init__.py

.....

These files are:

SCRAPY.CFG: Configuration file for Project

tutorial/: The project's Python module. You will then add the code here.

tutorial/items.py: Item file in the project.

tutorial/pipelines.py: The pipelines file in the project.

tutorial/settings.py: The configuration file in the project.

tutorial/spiders/: The directory where the spider code is placed.


Define Item

item is a container for saving crawled data, is similar to a Python dictionary, and provides an additional protection mechanism to avoid undefined field errors caused by spelling errors.

• Similar to what you do in an ORM, you can create a scrapy. The item class, and the definition type is scrapy. Field's class attribute to define an item.


650) this.width=650; "Src=" http://images2015.cnblogs.com/blog/1043214/201612/ 1043214-20161208170919991-1355272443.png "alt=" 1043214-20161208170919991-1355272443.png "/>

function of each component:

Scrapy Engine: This is the engines, responsible for spiders, itempipeline, Downloader, scheduler in the middle of communication, signal, data transmission and so on

Scheduler (Scheduler): It is responsible for accepting the requests request sent by the engine and arranging it in a certain way, queue up, and wait for the Scrapy engine (engine) to request, hand over to the engines. Simply put, it is the responsibility to accept the request from the engine and incorporate it into the team, returning the request when the engine requests them.

Downloader (Downloader): Responsible for downloading all requests requests sent by Scrapy engine, acquiring the page data and providing it to the engine, and returning the responses it obtains to Scrapy engine, And then the engine gave it to spiders to handle it,

Spiders: It handles all responses, extracts the data from it, gets the data needed for the item field, submits the URL that needs to be followed to the engine, and enters the Scheduler (scheduler) again,

Item Pipeline: It is responsible for handling the item obtained in the spiders, and processing, such as deduplication, persistent storage (save the database, write files, in short, to save data)

Downloader middlewares (Download middleware): You can be seen as a component that can customize the extended download function

Spider Middlewares (spider middleware): You can be understood as a function component that can customize the extension and operation engine and spiders intermediate ' communication ' (for example, to enter spiders responses; and requests out of spiders)


The flow of data across the Scrapy:

  Popular version:

When the program is running,

Engine: Hi! Spider, which website do you want to work on?

Spiders: I have to deal with 23wx.com.

Engine: You give me the URL of the first needed deal.

Spiders: give you the first URL is xxxxxxx.com

Engine: Hi! Scheduler, I have a request for you to help me sort the queue.

Scheduler: OK, I'm dealing with you. Wait a minute.

Engine: Hi! Dispatcher, give me the request you handled.

Scheduler: Here you are, this is the request I handled.

Engine: Hi! Downloader, you can download this request for me by following the settings of the download middleware.

Downloader: OK! Here you are, this is the download good thing. (If it fails: Sorry, this request failed to download, then the engine tells the scheduler , this request download failed, you record, we will download later.) )

Engine: Hi! Spiders, this is the download good thing, and has been processed according to Spider middleware, you handle it ( note!) Here responses by default is given to the Def parse this function to handle )

Spiders: (after processing the data for URLs that need to be followed up), hi! engine , this is the URL that I need to follow up, responses it to the function def xxxx (self, responses) processing. And this is the item I get.

engine : Hi! Item Pipeline I have an item for you to handle for me! Scheduler! This is the URL that I need you to help me to deal with. Then start the loop from step fourth until you get the information you need,

Attention! The entire program will stop only if there is no request in the scheduler (that is, the url,scrapy that failed to download will be downloaded again. )

In summary, the process is:

The core engine gets the initial URL from the crawler and generates a request task into the scheduler dispatch plan

The engine requests a new request crawl task to the scheduler and forwards it to the Downloader downloader

The Downloader loads the page and returns a response response to the engine

The engine forwards the response to spider crawlers to extract data and search for new follow-up addresses.

Processing results are distributed by the engine: extracted data--itempipeline pipeline, new follow-up address request-and scheduler

The process returns to the second loop execution until the task in the scheduler is processed


This article is from the "Road Crazy" blog, please be sure to keep this source http://adonislxf.blog.51cto.com/11770740/1882393

Scrapy Getting Started

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.