Scrapy:
OS: win7
Python: 2.7.
The first is to install easy_install scrapy is very easy to install, it is difficult to install so many dependent package http://doc.scrapy.org/en/0.16/intro/install.html here there are Windows installation instructions
If it is really compiled, or to install too many win stuff, go to http://www.lfd.uci.edu /~ Gohlke/pythonlibs/download a compiled library for Installation
Step 1: Create a project
scrapy startproject tutorial
Scrapy. cfg: Configuration File
Step 2: Create an item
Create an item in items. py
# Coding: utf8from scrapy. item import item, field # item is used to store captured content, similar to a dictionary class d1_item (item): "" This is model of item, likes ORM "Title = field () link = field () DESC = field ()
Part 3: First Crawler
The crawler class inherits scrapy. Spider. basespider and has three attributes:
Name start_urls parse
How can I parse webpage content? XPath should use the XPath selector to parse the content.
W3cschool some learning content w3cschool some learning content
Create a d1_spider.py file in the spiders directory.
#coding=utf8 from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorfrom tutorial.items import DmozItemclass DmozSpider(BaseSpider): """spider""" # name must be unique name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] #overwrite parse method def parse(self, response): #filename = response.url.split("/")[-2] #open(filename, "wb").write(response.body) # use hxs = HtmlXPathSelector(response) #extract all ul tag's children tag li sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item["title"] = site.select('a/text()').extract() item["link"] = site.select('a/@href').extract() item["desc"] = site.select('text()').extract() items.append(item) return items
Use the command in the directory where scrapy. cfg is located
scrapy crawl dmoz -o items.json -t json
At this time, an additional items. JSON in the main directory is the captured content.
This is the simplest way to crawl webpages-Parse webpages-and store data crawlers.
Reference: Official tutorial