Scrapy Study Notes

Source: Internet
Author: User

The so-called web crawler is a program that crawls data everywhere or in a targeted manner on the Internet. Of course, this is not professional enough. A more professional description is to capture the HTML data of a specific website webpage. However, because a website has many webpages, and we cannot know the URLs of all webpages in advance, it is a matter of study to ensure that we have captured all the HTML pages of the website. The general method is to define an entry page, and a page usually has URLs of other pages, so the URLs obtained from the current page are added to the crawling queue of The crawler, after entering the new page, recursively perform the above operations. In fact, it is the same as in-depth traversal or breadth traversal.

Scrapy is a twisted-based crawler framework implemented in pure python. Users can easily implement a crawler by customizing several modules to capture webpage content and various images. It is very convenient ~

Scrapy uses the Asynchronous Network Library twisted to process network communication. The architecture is clear and contains various middleware interfaces to flexibly meet various requirements. Shows the overall architecture:

The Green Line is the data flow direction. First, from the initial URL, sched will give it to downloader for download. After the download, schedider will hand it to SPIDER for analysis. The spider has two types of analysis results: one is the link that needs to be further crawled, such as the "next page" link analyzed previously, these items will be transmitted back to scheduler; the other is the data to be saved, they are sent to item pipeline, where data is processed (including detailed analysis, filtering, and storage. In addition, various middleware can be installed in the data flow channel for necessary processing.

1. Create a scrapy project: scrapy startproject project name

 

Microsoft Windows XP [Version 5.1.2600](C) Copyright 1985-2001 Microsoft Corp.T:\>scrapy startproject tutorialT:\>

 

This command will create a new directory tutorial in the current directory. Its structure is as follows:

 

T:\tutorial>tree /fFolder PATH listingVolume serial number is 0006EFCF C86A:7C52T:.│  scrapy.cfg│└─tutorial    │  items.py    │  pipelines.py    │  settings.py    │  __init__.py    │    └─spiders            __init__.py

 

These files are mainly:

 

  • Scrapy. cfg: project configuration file
  • Tutorial/: Python module of the project. The Code will be imported from here.
  • Tutorial/items. py: project items File
  • Tutorial/pipelines. py: Project Pipeline File
  • Tutorial/settings. py: project configuration file
  • Tutorial/spiders: directory where Spider is stored

DefinitionItem

Items is the container for loading captured data. It works like a dictionary in Python, but it provides more protection, such as filling undefined fields to prevent spelling errors.

It is declared by creating a scrapy. item. Item class and defining its attribute as a scrpy. item. Field object, just like an object relationship ing (ORM ).

In items. in The py file, define a scrapy. item. item class, which defines the field (scrpy. item. field Type)

From scrapy. item import item, field class d1_item (item): Title = field () link = field () DESC = field ()

Define spider

They define a preliminary list of URLs for download, how to trace links, and how to parse the content of these webpages for extracting items.

To create a spider, you must create a subclass for scrapy. Spider. basespider and determine the three main and mandatory attributes:

  • Name: identifies a crawler. It must be unique. You must define different names for different crawlers.
  • Start_urls: A list of URLs that crawlers start to crawl. Crawlers start to capture data from here, so the data downloaded for the first time will start from these URLs. Other sub-URLs are generated from these starting URLs.
  • Parse (): crawler method. When calling, the response object returned from each URL is passed as the parameter. response will be the only parameter of the parse method,

This method is used to parse the returned data, match the captured data (parsed as an item), and track more URLs.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.