Scrapy Crawler Framework Tutorial (i)--Introduction to Scrapy

Source: Internet
Author: User

Blog post address: Scrapy Crawler Framework Tutorial (i) –scrapy Introductory Preface

Become a Python programmer has been three months, the three Scrapy crawler framework to write more than 200 reptiles, can not say proficient scrapy, but has a certain familiarity with scrapy. Ready to write a series of Scrapy crawler tutorials, on the one hand, through the output to consolidate and comb the knowledge of their time, on the other hand, I benefited from other people's blog tutorials, I also want to use this series of tutorials to help some people want to learn scrapy.
Scrapy Introduction

Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data.
It was originally designed for page crawling (or, more specifically, web crawling), or it can be applied to get the data returned by the API (such as Amazon Associates Web Services) or a generic web crawler. Architecture Overview

the role of each component scrapy Engine

The engine is responsible for controlling the flow of data in all components of the system and triggering events when the corresponding action occurs. See the Data Flow section below for more information.

This component is equivalent to the "brain" of a reptile, the dispatch center of the entire reptile. Scheduler (Scheduler)

The scheduler accepts requests from the engine and takes them on the team so that the engine can be supplied to the engine upon request.

The initial crawl URL and subsequent URLs that are fetched in the page are placed in the scheduler, waiting for crawling. At the same time, the scheduler automatically removes duplicate URLs (if a particular URL does not need to be reset and can be implemented by setting, such as the URL of the POST request) Downloader (Downloader)

The downloader is responsible for acquiring the page data and providing it to the engine, which is then provided to Spider. Spiders

Spider is a class that scrapy users write to parse the response and extract the item (that is, the item that was fetched) or the URL for an extra follow-up. Each spider is responsible for handling a specific (or some) Web site. Item Pipeline

Item pipeline is responsible for processing the item that is extracted by the spider. Typical processing is cleanup, validation, and persistence (for example, access to a database).

When the page is saved to the item by the data required by the crawler, it is sent to the project pipeline (Pipeline), processing the data in a few specific order, and then depositing it in a local file or in a database. Download middleware (Downloader middlewares)

The downloader middleware is a specific hook (specific hook) between the engine and the downloader, handling the response that downloader passes to the engine. It provides an easy mechanism to extend the Scrapy functionality by inserting custom code.

By setting up the download middleware, the crawler can automatically replace User-agent, IP and other functions. Spider Middleware (Spider middlewares)

The spider middleware is a specific hook between the engine and the spider (Specific hook), which handles spider input (response) and output (items and requests). It provides an easy mechanism to extend the Scrapy functionality by inserting custom code. Data Flow

The engine opens a Web site (open a domain), finds the spider that handles the site, and requests the first URL to crawl to the spider (s).

The engine obtains the first URL to crawl from the spider and dispatches it to the Scheduler (Scheduler).

The engine requests the next URL to crawl to the scheduler.

The scheduler returns the next URL to crawl to the engine, which sends the URL to the Downloader (downloader) by downloading the middleware (request) direction.

Once the page has been downloaded, the downloader generates a response of the page and sends it to the engine via the download middleware (return (response) direction).

The engine receives the response from the downloader and sends the spider processing via the spider middleware (input direction).

Spider handles response and returns the crawled item and (follow-up) new request to the engine.

The engine will (spider returned) the item crawled to the item Pipeline, and the (spider returned) request to the dispatcher.

Repeat (from step two) until there is no more request in the scheduler, the engine closes the Web site. Establish Scrapy crawler Project Process Create a project

Before you start crawling, you first create a new Scrapy project. Here to crawl my blog for example, go to the directory where you intend to store your code, and run the following command:

Scrapy Startproject Scrapyspider

The command will create a scrapyspider directory that contains the following:

scrapyspider/
    scrapy.cfg
    scrapyspider/
        __init__.py
        items.py
        pipelines.py
        spiders/
            __init__.py
            ...

These files are: Scrapy.cfg: The project's configuration file. tutorial/: The Python module for this project. You will then join the code here. tutorial/items.py: Item file in the project. tutorial/pipelines.py: Pipelines files in the project. tutorial/settings.py: The project's settings file. tutorial/spiders/: The directory where the spider code is placed. Write the first reptile (Spider)

Spider is a class that users write to crawl data from a single Web site (or some Web sites).

It contains an initial URL for downloading, how to follow links in a Web page, and how to analyze the contents of a page to extract the method for generating the item.

In order to create a spider, you must inherit scrapy. Spider class, and defines the following three properties: Name: Used to distinguish Spider. The name must be unique, and you may not set the same name for different spider. Start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. Parse () is a method of spider. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response), extracting data (generating item), and generating the Request object for URLs that need further processing.

The following is our first spider code, saved in the blog_spider.py file in the Scrapyspider/spiders directory:

From scrapy.spiders import Spider


class Blogspider (Spider):
    name = ' Woodenrobot '
    start_urls = [' https:// Woodenrobot.me ']

    def parse (self, Response):
        titles = Response.xpath ('//a[@class = ' Post-title-link ']/text () ' ). Extract () for
        title in titles:
            print Title.strip ()
start the crawler .

Open cmd in the project folder to run the following command:

Scrapy Crawl Woodenrobot

After you start the crawler you can see all the article titles printed out on the current page.

Ps: This tutorial is a simple introduction to so many, there are many things I have not thought well. Look forward to the dry goods behind it. Reference Articles scrapy official Chinese documents

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.