Scrapy Study Notes

Last Update:2014-07-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The so-called web crawler is a program that crawls data everywhere or in a targeted manner on the Internet. Of course, this is not professional enough. A more professional description is to capture the HTML data of a specific website webpage. However, because a website has many webpages, and we cannot know the URLs of all webpages in advance, it is a matter of study to ensure that we have captured all the HTML pages of the website. The general method is to define an entry page, and a page usually has URLs of other pages, so the URLs obtained from the current page are added to the crawling queue of The crawler, after entering the new page, recursively perform the above operations. In fact, it is the same as in-depth traversal or breadth traversal.

Scrapy is a twisted-based crawler framework implemented in pure python. Users can easily implement a crawler by customizing several modules to capture webpage content and various images. It is very convenient ~

Scrapy uses the Asynchronous Network Library twisted to process network communication. The architecture is clear and contains various middleware interfaces to flexibly meet various requirements. Shows the overall architecture:

The Green Line is the data flow direction. First, from the initial URL, sched will give it to downloader for download. After the download, schedider will hand it to SPIDER for analysis. The spider has two types of analysis results: one is the link that needs to be further crawled, such as the "next page" link analyzed previously, these items will be transmitted back to scheduler; the other is the data to be saved, they are sent to item pipeline, where data is processed (including detailed analysis, filtering, and storage. In addition, various middleware can be installed in the data flow channel for necessary processing.

1. Create a scrapy project: scrapy startproject project name

Microsoft Windows XP [Version 5.1.2600](C) Copyright 1985-2001 Microsoft Corp.T:\>scrapy startproject tutorialT:\>

This command will create a new directory tutorial in the current directory. Its structure is as follows:

T:\tutorial>tree /fFolder PATH listingVolume serial number is 0006EFCF C86A:7C52T:.│  scrapy.cfg│└─tutorial    │  items.py    │  pipelines.py    │  settings.py    │  __init__.py    │    └─spiders            __init__.py

These files are mainly:

Scrapy. cfg: project configuration file
Tutorial/: Python module of the project. The Code will be imported from here.
Tutorial/items. py: project items File
Tutorial/pipelines. py: Project Pipeline File
Tutorial/settings. py: project configuration file
Tutorial/spiders: directory where Spider is stored

DefinitionItem

Items is the container for loading captured data. It works like a dictionary in Python, but it provides more protection, such as filling undefined fields to prevent spelling errors.

It is declared by creating a scrapy. item. Item class and defining its attribute as a scrpy. item. Field object, just like an object relationship ing (ORM ).

In items. in The py file, define a scrapy. item. item class, which defines the field (scrpy. item. field Type)

From scrapy. item import item, field class d1_item (item): Title = field () link = field () DESC = field ()

Define spider

They define a preliminary list of URLs for download, how to trace links, and how to parse the content of these webpages for extracting items.

To create a spider, you must create a subclass for scrapy. Spider. basespider and determine the three main and mandatory attributes:

Name: identifies a crawler. It must be unique. You must define different names for different crawlers.
Start_urls: A list of URLs that crawlers start to crawl. Crawlers start to capture data from here, so the data downloaded for the first time will start from these URLs. Other sub-URLs are generated from these starting URLs.
Parse (): crawler method. When calling, the response object returned from each URL is passed as the parameter. response will be the only parameter of the parse method,

This method is used to parse the returned data, match the captured data (parsed as an item), and track more URLs.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Study Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy Study Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support