python爬蟲 scrapy架構學習_

python爬蟲 scrapy架構學習__python

最後更新：2018-07-30 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

python爬蟲 scrapy架構學習

一、步驟：
建立項目 (Project)：建立一個新的爬蟲項目
明確目標（Items）：明確你想要抓取的目標
製作爬蟲（Spider）：製作爬蟲開始爬取網頁
儲存內容（Pipeline）：設計管道儲存爬取內容

1、建立項目
scrapy startproject filename baidu.com

2、明確目標
在Scrapy中，items是用來載入抓取內容的容器，有點像Python中的Dic，也就是字典，但是提供了一些額外的保護減少錯誤。
一般來說，item可以用scrapy.item.Item類來建立，並且用scrapy.item.Field對象來定義屬性（可以理解成類似於ORM的映射關係）。
接下來，我們開始來構建item模型（model）。
首先，我們想要的內容有：
作者（author）
內容（text）
標籤（tags）

3、製作爬蟲也是最關鍵的一步

# -*- coding: utf-8 -*-import scrapyimport syssys.path.append("D:\\pycodes\\quotes")from quotes.items import quotesItemclass BooksSpider(scrapy.Spider):    name = 'books'    allowed_domains = ['quotes.toscrape.com']    start_urls = ['http://quotes.toscrape.com/']    def parse(self, response):        for sel in response.xpath('//div[@class="quote"]'):            item = quotesItem()            item['text']=sel.xpath('span[@class="text"]/text()').extract()            item['author']=sel.xpath('span/small/text()').extract()            item['tags']=sel.xpath('div/a/text()').extract()            yield item

4、設計通道

通過設計pipeline通道，來處理item資料。

class DoubanPipeline(object):    def process_item(self, item, spider):        return itemclass DoubanInfoPipeline(object):    def open_spider(self,spider):        self.f=open("result.txt","w")    def close_spider(self,spider):        self.f.close()    def process_item(self,item,spider):        try:            line = str(dict(item)) + '\n'            self.f.write(line)        except:            pass        return item

1、選取器xpath的使用
response.xpath(//div/@href).extract()
response.xpath(//div[@href]/text()).extract()
response.xpath(//div[contains(@href,”image”)]/@href

若在div下選擇不是直系子節點的p，需要
div.xpath(“.//p”)注意加.

2、xpath.re的應用
Selector 也有一個 .re() 方法，用來通過Regex來提取資料。然而，不同於使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字串的列表。所以你無法構造嵌套式的 .re() 調用。

下面是一個例子，從上面的 HTML code 中提取映像名字:

response.xpath(‘//a[contains(@href, “image”)]/text()’).re(r’Name:\s*(.*)’)
[u’My image 1’,
u’My image 2’,
u’My image 3’,
u’My image 4’,
u’My image 5’]

3、
例如在XPath的 starts-with() 或 contains() 無法滿足需求時， test() 函數可以非常有用。

例如在列表中選擇有”class”元素且結尾為一個數位連結:

from scrapy import Selector

doc = “””
…
…
… first item … second item … third item … fourth item … fifth item …
…
… “””
sel = Selector(text=doc, type=”html”)
sel.xpath(‘//li//@href’).extract()
[u’link1.html’, u’link2.html’, u’link3.html’, u’link4.html’, u’link5.html’]
sel.xpath(‘//li[re:test(@class, “item-\d$”)]//@href’).extract()
[u’link1.html’, u’link2.html’, u’link4.html’, u’link5.html’]

3、for index,link in enumberate(links):
print (index,link)
0 link1
1 link2
…

4、不一定非按照四個步驟來
有時可以預設不改變items.py
直接在spider.py裡產生產生的字典，例如：
yield{

等等

5、遞迴連結，分布爬取,

在parse(self,response):
方法中加入：

next_page=response.xpath("")if next_page：    next_page=response.urljoin(next_page)    yield scrapy.Request(next_page,callback=self.parse)

6、如何防止出現403錯誤：
需要調節 setting.py 檔案
調節USER_AGENT
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5’
類比瀏覽器訪問

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More