[Python] web crawler (12): The first reptile example of the reptile Framework Scrapy tutorial __python

Source: Internet
Author: User

reproduced from: http://blog.csdn.net/pleasecallmewhy/article/details/19642329

(Suggest everyone to read more about the official website tutorial: Tutorial address)


We use the dmoz.org site as a small grab to catch a show of skill.


First you have to answer a question.

Q: Put the Web site into a reptile, a total of several steps.

The answer is simple, step four: New Project (Project): Create a new reptile project clear goal (items): Define the target you want to crawl (Spider): Make crawler start crawl Web page storage content (Pipeline): Design Pipeline Store crawl content


OK, now that the basic process is OK, you can do it step-by-step.


1. New Item (Project)

Hold down the SHIFT key in the empty directory and select "Open Command Window Here" and enter the command:

[plain] view plain copy scrapy startproject tutorial
Where tutorial is the project name.

You can see that a tutorial folder will be created with the following directory structure:

[plain] view plain copy tutorial/scrapy.cfg tutorial/__init__.py items.py pipelines.py settings.py spiders/__init__.py ...


Here's a brief introduction to the role of each file: Scrapy.cfg: Project configuration file
tutorial/: The Python module for the project, which will refer to code from here tutorial/items.py: Project's Items file tutorial/pipelines.py: Pipelines file for the project tutorial/ settings.py: Project Settings file tutorial/spiders/: directory where reptiles are stored


2. Clear Objectives (Item)

In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.

In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).

Next, we start to build the item model.

First of all, we want to have the following: Name link (URL) description (description)


Modify the items.py file in the Tutorial directory and add our own class after the original class.

Because we want to capture the content of the dmoz.org website, we can name it dmozitem:

[python]   View plain  copy   # define here the models  for your scraped items   #   # see documentation in:    # http://doc.scrapy.org/en/latest/topics/items.html      From scrapy.item  import Item, Field      Class tutorialitem (Item):        # define the fields for your item here like:        # name = field ()        pass       Class dmozitem (Item):       title = field ()        link = field ()        desc =  field ()   

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.