First, create the project
Scrapy Startproject DMOZ
Ii. Establishment of dmoz_spider.py
From scrapy.spider import spiderfrom scrapy.selector import selector from dmoz.items import dmozitem class dmozspider (Spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/ programming/languages/python/books/", " http://www.dmoz.org/ Computers/programming/languages/python/resources/", ] def parse (self, response): "" " the lines below is a spider contract. for more info see: http://doc.scrapy.org/en/latest/topics/ Contracts.html @url http://www.dmoz.org/computers/programming/languages/python/ resources/ @scrapes name "" " sel = selector (response) sites = sel.xpath ('//ul[@class = ' directory-url ']/li ') items = [] for site in sites: item = dmozitem () item[ ' Name '] = site.xpath (' A/text () '). Extract () item[' url '] = site.xpath (' a @href '). Extract () item[' description '] = site.xpath (' text () '). Re ('-\s[^\n]*\\r ') items.append (item) return items
Third, rewrite items.py
#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, field class Dmozitem (item): Name = field () Description = Field () URL = Field ()
Iv. Rewriting pipeline.py
#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, field class Dmozitem (item): Name = field () Description = Field () URL = Field ()
V. Execute in the DMOZ folder root directory
Scrapy Crawl Dmoz-o Dmoz.json
Run spider
Scrapy Learning Note 1---A complete example of a crawl