1 Creating a Project
Scrapy Startproject Tutorial
2 Defining the item
Import Scrapy
Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()
After the Paser data is saved to the item list, it is passed to pipeline using
3 Write the first crawler (spider), saved in the Tutorial/spiders directory dmoz_spider.py, the crawler to be based on the file name to start.
Import Scrapy
Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
Def parse (self, Response):
item = Dmozitem ()
item[' title '] = Sel.xpath (' A/text () '). Extract ()
item[' link ' = Sel.xpath (' A ' @href '). Extract ()
item[' desc '] = Sel.xpath (' text () '). Extract ()
Yield item
Start_urls Setting the list of URLs to crawl
The parse member function calls the extracted information from the page after a page has been crawled and is saved to the list of previously defined item dictionaries. Note Dmozitem The class defined for the second step
4 Pipeline
When item is collected in the spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order. Define the pipeline processing order in settings.py.
Pipline processes data while deciding whether to pass data to the next pipeline
Import JSON
Class Jsonwriterpipeline (object):
def __init__ (self):
Self.file = open (' Items.jl ', ' WB ')
def process_item (self, item, spider):
line = Json.dumps (Dict (item)) + "\ n"
Self.file.write (line)
Return item
5 Starting Crawler
Scrapy Crawl DMOZ
Python web crawler Framework scrapy instructions for use