reproduced from: http://blog.csdn.net/pleasecallmewhy/article/details/19642329
(Suggest everyone to read more about the official website tutorial: Tutorial address)
We use the dmoz.org site as a small grab to catch a show of skill.
First you have to answer a question.
Q: Put the Web site into a reptile, a total of several steps.
The answer is simple, step four: New Project (Project): Create a new reptile project clear goal (items): Define the target you want to crawl (Spider): Make crawler start crawl Web page storage content (Pipeline): Design Pipeline Store crawl content
OK, now that the basic process is OK, you can do it step-by-step.
1. New Item (Project)
Hold down the SHIFT key in the empty directory and select "Open Command Window Here" and enter the command:
[plain] view plain copy scrapy startproject tutorial
Where tutorial is the project name.
You can see that a tutorial folder will be created with the following directory structure:
[plain] view plain copy tutorial/scrapy.cfg tutorial/__init__.py items.py pipelines.py settings.py spiders/__init__.py ...
Here's a brief introduction to the role of each file: Scrapy.cfg: Project configuration file
tutorial/: The Python module for the project, which will refer to code from here tutorial/items.py: Project's Items file tutorial/pipelines.py: Pipelines file for the project tutorial/ settings.py: Project Settings file tutorial/spiders/: directory where reptiles are stored
2. Clear Objectives (Item)
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, we want to have the following: Name link (URL) description (description)
Modify the items.py file in the Tutorial directory and add our own class after the original class.
Because we want to capture the content of the dmoz.org website, we can name it dmozitem:
[python] View plain copy # define here the models for your scraped items # # see documentation in: # http://doc.scrapy.org/en/latest/topics/items.html From scrapy.item import Item, Field Class tutorialitem (Item): # define the fields for your item here like: # name = field () pass Class dmozitem (Item): title = field () link = field () desc = field ()