Python Scrapy crawler framework simple learning notes, pythonscrapy Crawler

Last Update:2016-01-24 Source: Internet

Author: User

Tags python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python Scrapy crawler framework simple learning notes, pythonscrapy Crawler

1. simple configuration to obtain the content on a single web page.
(1) create a scrapy Project

scrapy startproject getblog

(2) EDIT items. py

# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item):  title = Field()  desc = Field()

(3) create blog_spider.py In the spiders folder

Need to be familiar with the next xpath selection, feel like JQuery selector, but not as comfortable as JQuery selector (w3school Tutorial: http://www.w3school.com.cn/xpath ).

# Coding = UTF-8 from scrapy. spider import Spiderfrom getblog. items import BlogItemfrom scrapy. selector import Selector class BlogSpider (Spider): # ID name = 'blog '# Start address start_urls = ['HTTP: // www.cnblogs.com/'] def parse (self, response ): sel = Selector (response) # Xptah Selector # select all div TAG content with class attributes and values of 'Post _ item' # All content of the 2nd div below sites = sel. xpath ('// div [@ class = "post_item"]/div [2]') items = [] for site in sites: item = BlogItem () # select the text content 'text () 'item ['title'] = site under the h3 label and under the label. xpath ('h3/a/text ()'). extract () # Same as above, the text content under the p tag 'text () 'item ['desc'] = site. xpath ('P [@ class = "post_item_summary"]/text ()'). extract () items. append (item) return items

(4) run,

Scrapy crawl blog #

(5) output file.

Output the configuration in settings. py.

# Output file location: FEED_URI = 'blog. xml' # The output file format can be json, xml, or csvFEED_FORMAT = 'xml'

The output location is under the project root folder.

Ii. Basic -- scrapy. spider. Spider

(1) Use Interactive shell

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"

04:09:11 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 04:09:11 + 0800 [scrapy] INFO: Optional features available: ssl, http11, django2014-08-21 04:09:11 + 0800 [scrapy] INFO: Overridden settings: {'logstats _ INTERVAL ': 0} 04:09:11 + 0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, coreStats, SpiderState2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled downloader middlewares: Average, minimum, UserAgentMiddleware, RetryMiddleware, minimum, minimum, RedirectMiddleware, CookiesMiddleware, minimum, downloaderStats2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, California, DepthMiddleware2014-08-21 04:09:11 + 0800 [scrapy] INFO: Enabled item pipelines: 04:09:11 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: 60242014-08-21 04:09:11 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1: 60812014-08-21 04:09:11 + 0800 [default] INFO: Spider opened2014-08-21 04:09:12 + 0800 [default] DEBUG: crawler (200) <GET http://www.baidu.com/> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy. crawler. crawler object at 0xa483cec> [s] item {} [s] request <GET http://www.baidu.com/> [s] response <200 http://www.baidu.com/> [s] settings <scrapy. settings. settings object at 0xa0de78c> [s] spider <Spider 'default' at 0xa78086c> [s] Useful shortcuts: [s] shelp () Shell help (print this help) [s] fetch (req_or_url) Fetch request (or URL) and update local objects [s] view (response) View response in a browser >>## response. all content returned by the body # response. xpath ('// ul/li') can test all xpath content More important, if you type response. selector you will access a selector object you can use toquery the response, and convenient shortcuts like response. xpath () and response.css () mapping toresponse. selector. xpath () and response.selector.css ()

That is, it is convenient to check whether the selection of xpath is correct in interactive form. I used FireFox F12 for selection, but it cannot be ensured that the content can be correctly selected each time.

You can also use:

Scrapy shell 'HTTP: // scrapy.org '-- nolog # parameter -- nolog no log

(2) Example

from scrapy import Spiderfrom scrapy_test.items import DmozItem  class DmozSpider(Spider):  name = 'dmoz'  allowed_domains = ['dmoz.org']  start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',         'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'         '']   def parse(self, response):    for sel in response.xpath('//ul/li'):      item = DmozItem()      item['title'] = sel.xpath('a/text()').extract()      item['link'] = sel.xpath('a/@href').extract()      item['desc'] = sel.xpath('text()').extract()      yield item

(3) Save the file

Yes. Save the file. The format can be json, xml, or csv.

scrapy crawl -o 'a.json' -t 'json'

(4) create a spider using a template

scrapy genspider baidu baidu.com # -*- coding: utf-8 -*-import scrapy  class BaiduSpider(scrapy.Spider):  name = "baidu"  allowed_domains = ["baidu.com"]  start_urls = (    'http://www.baidu.com/',  )   def parse(self, response):    pass

This is the first step. Remember the first five. Now we can only remember four .:-(

Remember to click the Save button. Otherwise it will affect your mood (⊙ o ⊙ )!

Iii. Advanced -- scrapy. contrib. spiders. Crawler

Example

# Coding = utf-8from scrapy. contrib. spiders import crawler, Rulefrom scrapy. contrib. linkextractors import LinkExtractorimport scrapy class TestSpider (crawler): name = 'test' allowed_domains = ['example. com '] start_urls = ['HTTP: // www.example.com/'] rules = (# tuples Rule (LinkExtractor (allow = ('category \. php ',), deny = ('subsection \. php ',), Rule (LinkExtractor (allow = ('item \. php ',), callback = 'pars _ item'),) def parse_item (self, response): self. log ('item page: % s' % response. url) item = scrapy. item () item ['id'] = response. xpath ('// td [@ id = "item_id"]/text ()'). re ('Id :( \ d +) ') item ['name'] = response. xpath ('// td [@ id = "item_name"]/text ()'). extract () item ['description'] = response. xpath ('// td [@ id = "item_description"]/text ()'). extract () return item

There are also XMLFeedSpider.

Class scrapy. contrib. spiders. XMLFeedSpider
Class scrapy. contrib. spiders. CSVFeedSpider
Class scrapy. contrib. spiders. SitemapSpider

Iv. Selector

  >>> from scrapy.selector import Selector  >>> from scrapy.http import HtmlResponse

You can use. css () and. xpath () to quickly select target data.

Take a good look at the selector. Xpath () and css (), and continue to be familiar with regular expressions.

When selecting through class, try to use css () for selection, and then use xpath () to select element familiarity.

5. Item Pipeline

Typical use for item pipelines are:
• Cleansing HTML data # Clear HTML data
• Validating scraped data (checking that the items contain certain fields) # verify the data
• Checking for duplicates (and dropping them) # Check for duplicates
• Storing the scraped item in a database # store it to the database
(1) verify data

from scrapy.exceptions import DropItem class PricePipeline(object):  vat_factor = 1.5  def process_item(self, item, spider):    if item['price']:      if item['price_excludes_vat']:        item['price'] *= self.vat_factor    else:      raise DropItem('Missing price in %s' % item)

(2) Write A Json File

import json class JsonWriterPipeline(object):  def __init__(self):    self.file = open('json.jl', 'wb')  def process_item(self, item, spider):    line = json.dumps(dict(item)) + '\n'    self.file.write(line)    return item

(3) Check for duplicates

from scrapy.exceptions import DropItem class Duplicates(object):  def __init__(self):    self.ids_seen = set()  def process_item(self, item, spider):    if item['id'] in self.ids_seen:      raise DropItem('Duplicate item found : %s' % item)    else:      self.ids_seen.add(item['id'])      return item

Writing data to the database should also be simple. In the process_item function, you can store the item.

Articles you may be interested in:

In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
No basic write python crawler: Use Scrapy framework to write Crawlers
Installing and configuring Scrapy
Install and use the Python crawler framework Scrapy
Example of using scrapy to parse js in python
Python uses scrapy to download a large page when
How to print scrapy by using Python to capture the Tree Structure

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More