Python Scrapy crawler framework simple learning notes, pythonscrapy Crawler

Source: Internet
Author: User
Tags python scrapy

Python Scrapy crawler framework simple learning notes, pythonscrapy Crawler

1. simple configuration to obtain the content on a single web page.
(1) create a scrapy Project

scrapy startproject getblog

(2) EDIT items. py

# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item):  title = Field()  desc = Field()

(3) create blog_spider.py In the spiders folder

Need to be familiar with the next xpath selection, feel like JQuery selector, but not as comfortable as JQuery selector (w3school Tutorial: http://www.w3school.com.cn/xpath ).

# Coding = UTF-8 from scrapy. spider import Spiderfrom getblog. items import BlogItemfrom scrapy. selector import Selector class BlogSpider (Spider): # ID name = 'blog '# Start address start_urls = ['HTTP: // www.cnblogs.com/'] def parse (self, response ): sel = Selector (response) # Xptah Selector # select all div TAG content with class attributes and values of 'Post _ item' # All content of the 2nd div below sites = sel. xpath ('// div [@ class = "post_item"]/div [2]') items = [] for site in sites: item = BlogItem () # select the text content 'text () 'item ['title'] = site under the h3 label and under the label. xpath ('h3/a/text ()'). extract () # Same as above, the text content under the p tag 'text () 'item ['desc'] = site. xpath ('P [@ class = "post_item_summary"]/text ()'). extract () items. append (item) return items

(4) run,

Scrapy crawl blog #

(5) output file.

Output the configuration in settings. py.

# Output file location: FEED_URI = 'blog. xml' # The output file format can be json, xml, or csvFEED_FORMAT = 'xml'

The output location is under the project root folder.

Ii. Basic -- scrapy. spider. Spider

(1) Use Interactive shell

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"
04:09:11 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 04:09:11 + 0800 [scrapy] INFO: Optional features available: ssl, http11, django2014-08-21 04:09:11 + 0800 [scrapy] INFO: Overridden settings: {'logstats _ INTERVAL ': 0} 04:09:11 + 0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, coreStats, SpiderState2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled downloader middlewares: Average, minimum, UserAgentMiddleware, RetryMiddleware, minimum, minimum, RedirectMiddleware, CookiesMiddleware, minimum, downloaderStats2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, California, DepthMiddleware2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled item pipelines: 04:09:11 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: 60242014-08-21 04:09:11 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1: 60812014-08-21 04:09:11 + 0800 [default] INFO: Spider opened2014-08-21 04:09:12 + 0800 [default] DEBUG: crawler (200) <GET http://www.baidu.com/> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy. crawler. crawler object at 0xa483cec> [s] item {} [s] request <GET http://www.baidu.com/> [s] response <200 http://www.baidu.com/> [s] settings <scrapy. settings. settings object at 0xa0de78c> [s] spider <Spider 'default' at 0xa78086c> [s] Useful shortcuts: [s] shelp () Shell help (print this help) [s] fetch (req_or_url) Fetch request (or URL) and update local objects [s] view (response) View response in a browser >>## response. all content returned by the body # response. xpath ('// ul/li') can test all xpath content More important, if you type response. selector you will access a selector object you can use toquery the response, and convenient shortcuts like response. xpath () and response.css () mapping toresponse. selector. xpath () and response.selector.css ()

That is, it is convenient to check whether the selection of xpath is correct in interactive form. I used FireFox F12 for selection, but it cannot be ensured that the content can be correctly selected each time.

You can also use:

Scrapy shell 'HTTP: // scrapy.org '-- nolog # parameter -- nolog no log

(2) Example

from scrapy import Spiderfrom scrapy_test.items import DmozItem  class DmozSpider(Spider):  name = 'dmoz'  allowed_domains = ['dmoz.org']  start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',         'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'         '']   def parse(self, response):    for sel in response.xpath('//ul/li'):      item = DmozItem()      item['title'] = sel.xpath('a/text()').extract()      item['link'] = sel.xpath('a/@href').extract()      item['desc'] = sel.xpath('text()').extract()      yield item

(3) Save the file

Yes. Save the file. The format can be json, xml, or csv.

scrapy crawl -o 'a.json' -t 'json'

(4) create a spider using a template

scrapy genspider baidu baidu.com # -*- coding: utf-8 -*-import scrapy  class BaiduSpider(scrapy.Spider):  name = "baidu"  allowed_domains = ["baidu.com"]  start_urls = (    'http://www.baidu.com/',  )   def parse(self, response):    pass

This is the first step. Remember the first five. Now we can only remember four .:-(

Remember to click the Save button. Otherwise it will affect your mood (⊙ o ⊙ )!

Iii. Advanced -- scrapy. contrib. spiders. Crawler

Example

# Coding = utf-8from scrapy. contrib. spiders import crawler, Rulefrom scrapy. contrib. linkextractors import LinkExtractorimport scrapy class TestSpider (crawler): name = 'test' allowed_domains = ['example. com '] start_urls = ['HTTP: // www.example.com/'] rules = (# tuples Rule (LinkExtractor (allow = ('category \. php ',), deny = ('subsection \. php ',), Rule (LinkExtractor (allow = ('item \. php ',), callback = 'pars _ item'),) def parse_item (self, response): self. log ('item page: % s' % response. url) item = scrapy. item () item ['id'] = response. xpath ('// td [@ id = "item_id"]/text ()'). re ('Id :( \ d +) ') item ['name'] = response. xpath ('// td [@ id = "item_name"]/text ()'). extract () item ['description'] = response. xpath ('// td [@ id = "item_description"]/text ()'). extract () return item

There are also XMLFeedSpider.


  • Class scrapy. contrib. spiders. XMLFeedSpider
  • Class scrapy. contrib. spiders. CSVFeedSpider
  • Class scrapy. contrib. spiders. SitemapSpider

Iv. Selector

  >>> from scrapy.selector import Selector  >>> from scrapy.http import HtmlResponse

You can use. css () and. xpath () to quickly select target data.

Take a good look at the selector. Xpath () and css (), and continue to be familiar with regular expressions.

When selecting through class, try to use css () for selection, and then use xpath () to select element familiarity.

5. Item Pipeline

Typical use for item pipelines are:
• Cleansing HTML data # Clear HTML data
• Validating scraped data (checking that the items contain certain fields) # verify the data
• Checking for duplicates (and dropping them) # Check for duplicates
• Storing the scraped item in a database # store it to the database
(1) verify data

from scrapy.exceptions import DropItem class PricePipeline(object):  vat_factor = 1.5  def process_item(self, item, spider):    if item['price']:      if item['price_excludes_vat']:        item['price'] *= self.vat_factor    else:      raise DropItem('Missing price in %s' % item)

(2) Write A Json File

import json class JsonWriterPipeline(object):  def __init__(self):    self.file = open('json.jl', 'wb')  def process_item(self, item, spider):    line = json.dumps(dict(item)) + '\n'    self.file.write(line)    return item

(3) Check for duplicates

from scrapy.exceptions import DropItem class Duplicates(object):  def __init__(self):    self.ids_seen = set()  def process_item(self, item, spider):    if item['id'] in self.ids_seen:      raise DropItem('Duplicate item found : %s' % item)    else:      self.ids_seen.add(item['id'])      return item

Writing data to the database should also be simple. In the process_item function, you can store the item.

Articles you may be interested in:
  • In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
  • Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Installing and configuring Scrapy
  • Install and use the Python crawler framework Scrapy
  • Example of using scrapy to parse js in python
  • Python uses scrapy to download a large page when 
  • How to print scrapy by using Python to capture the Tree Structure


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.