(1) Create Scrapy Project
Scrapy Startproject Getblog
(2) Edit items.py
#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, Fieldclass Blogitem (item): title = field () desc = field ()
(3) Under the Spiders folder, create the blog_spider.py
!! You need to familiarize yourself with the XPath selection, which is similar to the jquery selector, but not as comfortable as the jquery selector.
W3school Tutorial: http://www.w3school.com.cn/xpath/
# coding=utf-8from scrapy.spider import spiderfrom getblog.items import Blogitemfrom scrapy.selector import selectorclass blogspider (Spider): # identity name name = ' blog ' # start address start_urls = [' http://www.cnblogs.com/'] def parse ( Self, response): sel = selector (response) # Xptah selector # Select all div with the class attribute value ' Post_item ' Tag Content # 2nd div All content sites = sel.xpath ('//div[@class = ' post_item ']/div[2] ') items = [] for site in sites:&Nbsp; item = blogitem () # Select the text content under the H3 tab under a tag ' text () ' item[' title '] = site.xpath (' h3/a/ Text () '). Extract () # ibid., p label Text content ' text () ' item[' desc '] = site.xpath (' p[@class = ' post_item_summary ']/text () '). Extract () items.append (item) return items
(4) Run,
Scrapy Crawl Blog # can
(5) Output file.
The output configuration is performed in the settings.py.
# output File Location Feed_uri = ' blog.xml ' # output file format can be Json,xml,csvfeed_format = ' xml '
The output location is under the project root folder.
--August 20, 2014 05:51:46