This article mainly introduces the simple learning notes of the Python Scrapy crawler framework, from basic project creation to the use of CrawlSpider. For more information, see
1. simple configuration to obtain the content on a single web page.
(1) create a scrapy Project
scrapy startproject getblog
(2) EDIT items. py
# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item): title = Field() desc = Field()
(3) create blog_spider.py In the spiders folder
Need to be familiar with the next xpath selection, feel like JQuery selector, but not as comfortable as JQuery selector (w3school Tutorial: http://www.w3school.com.cn/xpath ).
# Coding = UTF-8 from scrapy. spider import Spiderfrom getblog. items import BlogItemfrom scrapy. selector import Selector class BlogSpider (Spider): # ID name = 'blog '# Start address start_urls = ['HTTP: // www.cnblogs.com/'] def parse (self, response ): sel = Selector (response) # Xptah Selector # select all p TAG content containing class attributes and values of 'Post _ item' # All content of 2nd p below sites = sel. xpath ('// p [@ class = "post_item"]/p [2]') items = [] for site in sites: item = BlogItem () # select the text content 'text () 'item ['title'] = site under the h3 label and under the label. xpath ('h3/a/text ()'). extract () # Same as above, the text content under the p tag 'text () 'item ['desc'] = site. xpath ('P [@ class = "post_item_summary"]/text ()'). extract () items. append (item) return items
(4) run,
Scrapy crawl blog #
(5) output file.
Output the configuration in settings. py.
# Output file location: FEED_URI = 'blog. xml' # The output file format can be json, xml, or csvFEED_FORMAT = 'xml'
The output location is under the project root folder.
Ii. Basic -- scrapy. spider. Spider
(1) Use Interactive shell
dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"
04:09:11 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 04:09:11 + 0800 [scrapy] INFO: Optional features available: ssl, http11, django2014-08-21 04:09:11 + 0800 [scrapy] INFO: Overridden settings: {'logstats _ INTERVAL ': 0} 04:09:11 + 0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, coreStats, SpiderState2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled downloader middlewares: Average, minimum, UserAgentMiddleware, RetryMiddleware, minimum, minimum, RedirectMiddleware, CookiesMiddleware, minimum, downloaderStats2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, California, DepthMiddleware2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled item pipelines: 04:09:11 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: 60242014-08-21 04:09:11 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1: 60812014-08-21 04:09:11 + 0800 [default] INFO: Spider opened2014-08-21 04:09:12 + 0800 [default] DEBUG: crawler (200)
(Referer: None) [s] Available Scrapy objects: [s] crawler
[S] item {} [s] request
[S] response <200 http://www.baidu.com/> [s] settings
[S] spider
[S] Useful shortcuts: [s] shelp () Shell help (print this help) [s] fetch (req_or_url) Fetch request (or URL) and update local objects [s] view (response) View response in a browser >>># response. all content returned by the body # response. xpath ('// ul/li') can test all xpath content More important, if you type response. selector you will access a selector object you can use toquery the response, and convenient shortcuts like response. xpath () and response.css () mapping toresponse. selector. xpath () and response.selector.css ()
That is, it is convenient to check whether the selection of xpath is correct in interactive form. I used FireFox F12 for selection, but it cannot be ensured that the content can be correctly selected each time.
You can also use:
Scrapy shell 'HTTP: // scrapy.org '-- nolog # parameter -- nolog no log
(2) Example
from scrapy import Spiderfrom scrapy_test.items import DmozItem class DmozSpider(Spider): name = 'dmoz' allowed_domains = ['dmoz.org'] start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,' ''] def parse(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
(3) Save the file
Yes. Save the file. The format can be json, xml, or csv.
scrapy crawl -o 'a.json' -t 'json'
(4) create a spider using a template
scrapy genspider baidu baidu.com # -*- coding: utf-8 -*-import scrapy class BaiduSpider(scrapy.Spider): name = "baidu" allowed_domains = ["baidu.com"] start_urls = ( 'http://www.baidu.com/', ) def parse(self, response): pass
This is the first step. Remember the first five. Now we can only remember four .:-(
Remember to click the Save button. Otherwise it will affect your mood (⊙ o ⊙ )!
Iii. Advanced -- scrapy. contrib. spiders. Crawler
Example
# Coding = utf-8from scrapy. contrib. spiders import crawler, Rulefrom scrapy. contrib. linkextractors import LinkExtractorimport scrapy class TestSpider (crawler): name = 'test' allowed_domains = ['example. com '] start_urls = ['HTTP: // www.example.com/'] rules = (# tuples Rule (LinkExtractor (allow = ('category \. php ',), deny = ('subsection \. php ',), Rule (LinkExtractor (allow = ('item \. php ',), callback = 'pars _ item'),) def parse_item (self, response): self. log ('item page: % s' % response. url) item = scrapy. item () item ['id'] = response. xpath ('// td [@ id = "item_id"]/text ()'). re ('Id :( \ d +) ') item ['name'] = response. xpath ('// td [@ id = "item_name"]/text ()'). extract () item ['description'] = response. xpath ('// td [@ id = "item_description"]/text ()'). extract () return item
There are also XMLFeedSpider.
- Class scrapy. contrib. spiders. XMLFeedSpider
- Class scrapy. contrib. spiders. CSVFeedSpider
- Class scrapy. contrib. spiders. SitemapSpider
Iv. Selector
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
You can use. css () and. xpath () to quickly select target data.
Take a good look at the selector. Xpath () and css (), and continue to be familiar with regular expressions.
When selecting through class, try to use css () for selection, and then use xpath () to select element familiarity.
5. Item Pipeline
Typical use for item pipelines are:
• Cleansing HTML data # Clear HTML data
• Validating scraped data (checking that the items contain certain fields) # verify the data
• Checking for duplicates (and dropping them) # Check for duplicates
• Storing the scraped item in a database # store it to the database
(1) verify data
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.5 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] *= self.vat_factor else: raise DropItem('Missing price in %s' % item)
(2) Write A Json File
import json class JsonWriterPipeline(object): def __init__(self): self.file = open('json.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + '\n' self.file.write(line) return item
(3) Check for duplicates
from scrapy.exceptions import DropItem class Duplicates(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem('Duplicate item found : %s' % item) else: self.ids_seen.add(item['id']) return item
Writing data to the database should also be simple. In the process_item function, you can store the item.