Simple learning notes for Python Scrapy crawler framework

Source: Internet
Author: User
Tags python scrapy
This article mainly introduces the simple learning notes of the Python Scrapy crawler framework, from basic project creation to the use of CrawlSpider. For more information, see 1. simple configuration to obtain the content on a single web page.
(1) create a scrapy project

scrapy startproject getblog

(2) edit items. py

# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item):  title = Field()  desc = Field()

(3) Create blog_spider.py in the spiders folder

Need to be familiar with the next xpath selection, feel like JQuery selector, but not as comfortable as JQuery selector (w3school Tutorial: http://www.w3school.com.cn/xpath ).

# Coding = UTF-8 from scrapy. spider import Spiderfrom getblog. items import BlogItemfrom scrapy. selector import Selector class BlogSpider (Spider): # ID name = 'blog '# Start address start_urls = ['http: // www.cnblogs.com/'] def parse (self, response ): sel = Selector (response) # Xptah Selector # select all p tag content containing class attributes and values of 'post _ item' # All content of 2nd p below sites = sel. xpath ('// p [@ class = "post_item"]/p [2]') items = [] for site in sites: item = BlogItem () # select the text content 'text () 'item ['title'] = site under the h3 label and under the label. xpath ('h3/a/text ()'). extract () # Same as above, the text content under the p tag 'text () 'item ['desc'] = site. xpath ('P [@ class = "post_item_summary"]/text ()'). extract () items. append (item) return items

(4) run,

Scrapy crawl blog #

(5) output file.

Output the configuration in settings. py.

# Output file location: FEED_URI = 'Blog. XML' # the output file format can be json, xml, or csvFEED_FORMAT = 'xml'

The output location is under the Project root folder.

II. basic -- scrapy. spider. Spider

(1) use interactive shell

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"

04:09:11 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 04:09:11 + 0800 [scrapy] INFO: Optional features available: ssl, http11, django2014-08-21 04:09:11 + 0800 [scrapy] INFO: Overridden settings: {'logstats _ INTERVAL ': 0} 04:09:11 + 0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, coreStats, SpiderState2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled downloader middlewares: average, minimum, UserAgentMiddleware, RetryMiddleware, minimum, minimum, RedirectMiddleware, CookiesMiddleware, minimum, downloaderStats2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, California, DepthMiddleware2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled item pipelines: 04:09:11 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: 60242014-08-21 04:09:11 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1: 60812014-08-21 04:09:11 + 0800 [default] INFO: Spider opened2014-08-21 04:09:12 + 0800 [default] DEBUG: crawler (200)
 
  
[S] item {} [s] request
  
   
[S] spider
   
    
[S] Useful shortcuts: [s] shelp () Shell help (print this help) [s] fetch (req_or_url) Fetch request (or URL) and update local objects [s] view (response) View response in a browser >>># response. all content returned by the body # response. xpath ('// ul/li') can test all xpath content More important, if you type response. selector you will access a selector object you can use toquery the response, and convenient shortcuts like response. xpath () and response.css () mapping toresponse. selector. xpath () and response.selector.css ()
   
  
 

That is, it is convenient to check whether the selection of xpath is correct in interactive form. I used FireFox F12 for selection, but it cannot be ensured that the content can be correctly selected each time.

You can also use:

Scrapy shell 'http: // scrapy.org '-- nolog # parameter -- nolog no log

(2) example

from scrapy import Spiderfrom scrapy_test.items import DmozItem  class DmozSpider(Spider):  name = 'dmoz'  allowed_domains = ['dmoz.org']  start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',         'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'         '']   def parse(self, response):    for sel in response.xpath('//ul/li'):      item = DmozItem()      item['title'] = sel.xpath('a/text()').extract()      item['link'] = sel.xpath('a/@href').extract()      item['desc'] = sel.xpath('text()').extract()      yield item

(3) save the file

Yes. Save the file. The format can be json, xml, or csv.

scrapy crawl -o 'a.json' -t 'json'

(4) Create a spider using a template

scrapy genspider baidu baidu.com # -*- coding: utf-8 -*-import scrapy  class BaiduSpider(scrapy.Spider):  name = "baidu"  allowed_domains = ["baidu.com"]  start_urls = (    'http://www.baidu.com/',  )   def parse(self, response):    pass

This is the first step. remember the first five. now we can only remember four .:-(

Remember to click the Save button. Otherwise it will affect your mood (⊙ o ⊙ )!

III. advanced -- scrapy. contrib. spiders. crawler

Example

# Coding = utf-8from scrapy. contrib. spiders import crawler, Rulefrom scrapy. contrib. linkextractors import LinkExtractorimport scrapy class TestSpider (crawler): name = 'test' allowed_domains = ['example. com '] start_urls = ['http: // www.example.com/'] rules = (# tuples Rule (LinkExtractor (allow = ('Category \. php ',), deny = ('subsection \. php ',), Rule (LinkExtractor (allow = ('item \. php ',), callback = 'Pars _ item'),) def parse_item (self, response): self. log ('item page: % s' % response. url) item = scrapy. item () item ['id'] = response. xpath ('// td [@ id = "item_id"]/text ()'). re ('Id :( \ d +) ') item ['name'] = response. xpath ('// td [@ id = "item_name"]/text ()'). extract () item ['description'] = response. xpath ('// td [@ id = "item_description"]/text ()'). extract () return item

There are also XMLFeedSpider.

  • Class scrapy. contrib. spiders. XMLFeedSpider
  • Class scrapy. contrib. spiders. CSVFeedSpider
  • Class scrapy. contrib. spiders. SitemapSpider

IV. Selector

  >>> from scrapy.selector import Selector  >>> from scrapy.http import HtmlResponse

You can use. css () and. xpath () to quickly select target data.

Take a good look at the selector. Xpath () and css (), and continue to be familiar with regular expressions.

When selecting through class, try to use css () for selection, and then use xpath () to select element familiarity.

5. Item Pipeline

Typical use for item pipelines are:
• Cleansing HTML data # clear HTML data
• Validating scraped data (checking that the items contain certain fields) # verify the data
• Checking for duplicates (and dropping them) # check for duplicates
• Storing the scraped item in a database # store it to the database
(1) verify data

from scrapy.exceptions import DropItem class PricePipeline(object):  vat_factor = 1.5  def process_item(self, item, spider):    if item['price']:      if item['price_excludes_vat']:        item['price'] *= self.vat_factor    else:      raise DropItem('Missing price in %s' % item)

(2) write a Json file

import json class JsonWriterPipeline(object):  def __init__(self):    self.file = open('json.jl', 'wb')  def process_item(self, item, spider):    line = json.dumps(dict(item)) + '\n'    self.file.write(line)    return item

(3) check for duplicates

from scrapy.exceptions import DropItem class Duplicates(object):  def __init__(self):    self.ids_seen = set()  def process_item(self, item, spider):    if item['id'] in self.ids_seen:      raise DropItem('Duplicate item found : %s' % item)    else:      self.ids_seen.add(item['id'])      return item

Writing data to the database should also be simple. In the process_item function, you can store the item.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.