The use of XPath () analysis to crawl data is relatively simple, but the URL of the jump and recursion, and so more troublesome. Delayed for a long time, or the watercress good ah, url so specifications. Alas, Amazon URL is a mess .... It may not be enough to understand the URL.
Amazon├──amazon│├──__init__.py│├──__init__.pyc│├──items.py│├──items.pyc│├──msic││├──__init__.p Y││└──pad_urls.py│├──pipelines.py│├──settings.py│├──settings.pyc│└──spiders│├──__init__.py│ ├──__init__.pyc│├──pad_spider.py│└──pad_spider.pyc├──pad.xml└──scrapy.cfg
(1) items.py
From scrapy import Item, Fieldclass Paditem (Item): Sno = field () Price = Field ()
(2) pad_spider.py
# -*- coding: utf-8 -*-from scrapy import spider, selectorfrom Scrapy.http import requestfrom amazon.items import paditemclass padspider (Spider ): name = "pad" allowed_domains = [" Amazon.com "] start_urls = [] u1 = "/HTTP// Www.amazon.cn/s/ref=sr_pg_ ' u2 = '? rh=n%3a2016116051%2cn%3a!2016117051%2cn% 3a888465051%2cn%3a106200071&page= ' u3 = ' &ie=utf8&qid=1408641827 ' for i in range (181): url = u1 + str (i+1) + u2 + str (i+1) + u3 start_urls.append (URL) def parse (self, response ):     &NBSp; sel = selector (response) sites = sel.xpath ('//div[@class = ' rsltgrid prod celwidget ') ' items = [] for site in sites: item = paditem () item[' Sno '] = site.xpath (' @ Name '). Extract () [0] try: item[' Price '] = Site.xpath (' Ul/li/div/a/span/text () '). Extract () [0] # Index exception, description is new except indexerror: item[' Price '] = site.xpath (' Ul/li/a/span/text () '). Extract () [0] items.append (item) return items
(3) settings.py
# -*- coding: utf-8 -*-# scrapy settings for amazon project## For simplicity, this file contains only the most important Settings by# default. all the other settings are documented here: ## http://doc.scrapy.org/en/latest/topics/settings.html#bot_name = ' Amazon ' spider_modules = [' amazon.spiders ']newspider_module = ' amazon.spiders ' # Crawl responsibly by identifying yourself (And your website) on the user-agent#USER_AGENT = ' amazon (+http://www.yourdomain.com) ' user_agent = ' mozilla/5.0 (macintosh; intel mac os x 10_8_3) AppleWebKit/536.5 ( Khtml, like gecko) chrome/19.0.1084.54 safari/536.5 ' feed_uri = ' pad.xml ' FEED_ format = ' xml '
(4) results are as follows Pad.xml
<?xml version= "1.0" encoding= "Utf-8"?><items> <item> <sno>B00JWCIJ78</sno> <price>¥3199.00</price> </item> <item> <sno>B00E907DKM</sno> <price>¥3079.00</price> </item > <item> <sno>b00l8r7hka </sno> <price>¥3679.00</price> </item> <item> <sno>b00iz8w4f8</sno> <price>¥3399.00</ Price> &nbsP;</item> <item> <sno> B00mjmw4bu</sno> <price>¥4399.00</price> </item> <item> <sno>B00HV7KAMI</sno> <price> ¥3799.00</price> </item> ...</items>
(5) Save data and save to database
...
--August 22, 2014 04:12:43