Python crawler----(3. Scrapy frame, simple application)

Source: Internet
Author: User
Tags xpath

(1) Create Scrapy Project

Scrapy Startproject Getblog

(2) Edit items.py

#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# HTTP://DOC.SCRAPY.ORG/EN/L Atest/topics/items.htmlfrom Scrapy.item Import Item, Fieldclass Blogitem (item): title = field () desc = field ()

(3) Under the Spiders folder, create the blog_spider.py

!! You need to familiarize yourself with the XPath selection, which is similar to the jquery selector, but not as comfortable as the jquery selector.

W3school Tutorial: http://www.w3school.com.cn/xpath/

# coding=utf-8from scrapy.spider import spiderfrom getblog.items import  Blogitemfrom scrapy.selector import selectorclass blogspider (Spider):     #  identity name     name =  ' blog '     #  start address      start_urls = [' http://www.cnblogs.com/']    def parse ( Self, response):         sel = selector (response)  #  Xptah  selector         #  Select all div with the class attribute value ' Post_item '   Tag Content         #    2nd div    All content          sites = sel.xpath ('//div[@class = ' post_item ']/div[2] ')          items = []         for site in sites:&Nbsp;           item = blogitem ()              #  Select the text content under the H3 tab under a tag   ' text () '              item[' title '] = site.xpath (' h3/a/ Text () '). Extract ()             #  ibid., p label   Text content   ' text () '             item[' desc ']  = site.xpath (' p[@class = ' post_item_summary ']/text () '). Extract ()              items.append (item)         return  items

(4) Run,

Scrapy Crawl Blog # can

(5) Output file.

The output configuration is performed in the settings.py.

# output File Location Feed_uri = ' blog.xml ' # output file format can be Json,xml,csvfeed_format = ' xml '

The output location is under the project root folder.


                                                                            --August 20, 2014 05:51:46

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.