2017.08.10 Python爬蟲實戰之爬蟲攻防

最後更新：2017-08-10 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：latest log index settings 9.png ext import 技術 ebs

1.建立一般的爬蟲：一般來說，小於100次訪問的爬蟲都無須為此擔心

（1）以爬取美劇天堂為例，來源網頁：http://www.meijutt.com/new100.html，項目準備：

scrapy startproject meiju100

F:\Python\PythonWebScraping\PythonScrapyProject>cd meiju100

F:\Python\PythonWebScraping\PythonScrapyProject\meiju100>scrapy genspider meiju100Spider meijutt.com

專案檔結構：

（2）修改items.py檔案：

（3）修改meiju100Spider.py檔案：

先檢查網頁原始碼：發現<div class="lasted-num fn-left">開頭的標籤，包含所需資料：

# -*- coding: utf-8 -*-
import scrapy
from meiju100.items import Meiju100Item

class Meiju100spiderSpider(scrapy.Spider):
    name = ‘meiju100Spider‘
    allowed_domains = [‘meijutt.com‘]
    start_urls = (
        ‘http://www.meijutt.com/new100.html‘
    )

    def parse(self, response):
        subSelector=response.xpath(‘//li/div[@class="lasted-num fn-left"]‘)
        items=[]
        for sub in subSelector:
            item=Meiju100Item()
            item[‘storyName‘]=sub.xpath(‘../h5/a/text()‘).extract()[0]
            item[‘storyState‘]=sub.xpath(‘../span[@class="state1 new100state1"]/text()‘).extract()[0]
            item[‘tvStation‘]=sub.xpath(‘../span[@class="mjtv"]/text()‘).extract()
            item[‘updateTime‘]=sub.xpath(‘//div[@class="lasted-time new100time fn-right"]/text()‘).extract()[0]   //運行報錯：IndexError: list index out of range，<div class="lasted-time new100time fn-right">不屬於上邊的父節點
            items.append(item)
            
        return items

（4）編寫pipelinses.py檔案，儲存爬取的資料到檔案夾：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time

class Meiju100Pipeline(object):
    def process_item(self, item, spider):
        today=time.strftime(‘%Y%m%d‘,time.localtime())
        fileName=today+‘meiju.txt‘
        with open(fileName,‘a‘) as fp:
            fp.write("%s \t" %(item[‘storyName‘].encode(‘utf8‘)))
            fp.write("%s \t" %(item[‘storyState‘].encode(‘utf8‘)))
            if len(item[‘tvStation‘])==0:
                fp.write("unknow \t")
            else:
                fp.write("%s \t" %(item[‘tvStation‘][0]).encode(‘utf8‘))
            fp.write("%s \n" %(item[‘updateTime‘].encode(‘utf8‘)))

        return item

（5）修改settings.py檔案：

（6）在meiju項目下任意目錄下，運行命令：scrapy crawl meiju100Spider

運行結果：

2.封鎖間隔時間破解：Scrapy在兩次請求之間的時間設定DOWNLOAD_DELAY,如果不考慮反爬蟲的因素，這個值當然是越小越好，

如果把DOWNLOAD_DELAY的值設定為0.1，也就是每0.1秒向網站請求一次網頁。

所以，需要在settings.py的尾部追加這一項即可：

3.封鎖Cookies破解：總所周知，網站是通過Cookies來確定使用者身份的，Scrapy爬蟲在爬取資料時使用同一個Cookies發送請求，這種做法和把DOWNLOAD_DELAY設定為0.1沒什麼區別。

所以，要破解這種原理的反爬蟲也很簡單，直接禁用Cookies就可以了，在Setting.py檔案後追加一項：

2017.08.10 Python爬蟲實戰之爬蟲攻防

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

2017.08.10 Python爬蟲實戰之爬蟲攻防

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support