基於scrapy實現的簡單蜘蛛採集程式

來源:互聯網
上載者:User
本文執行個體講述了基於scrapy實現的簡單蜘蛛採集程式。分享給大家供大家參考。具體如下:

# Standard Python library imports# 3rd party importsfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelector# My importsfrom poetry_analysis.items import PoetryAnalysisItemHTML_FILE_NAME = r'.+\.html'class PoetryParser(object):  """  Provides common parsing method for poems formatted this one specific way.  """  date_pattern = r'(\d{2} \w{3,9} \d{4})'   def parse_poem(self, response):    hxs = HtmlXPathSelector(response)    item = PoetryAnalysisItem()    # All poetry text is in pre tags    text = hxs.select('//pre/text()').extract()    item['text'] = ''.join(text)    item['url'] = response.url    # head/title contains title - a poem by author    title_text = hxs.select('//head/title/text()').extract()[0]    item['title'], item['author'] = title_text.split(' - ')    item['author'] = item['author'].replace('a poem by', '')    for key in ['title', 'author']:      item[key] = item[key].strip()    item['date'] = hxs.select("//p[@class='small']/text()").re(date_pattern)    return itemclass PoetrySpider(CrawlSpider, PoetryParser):  name = 'example.com_poetry'  allowed_domains = ['www.example.com']  root_path = 'someuser/poetry/'  start_urls = ['http://www.example.com/someuser/poetry/recent/',         'http://www.example.com/someuser/poetry/less_recent/']  rules = [Rule(SgmlLinkExtractor(allow=[start_urls[0] + HTML_FILE_NAME]),                  callback='parse_poem'),       Rule(SgmlLinkExtractor(allow=[start_urls[1] + HTML_FILE_NAME]),                  callback='parse_poem')]

希望本文所述對大家的Python程式設計有所協助。

  • 聯繫我們

    該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

    如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.