2017.08.10 Python crawler Combat attack

Last Update:2017-08-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Create a generic crawler: In general, crawlers less than 100 visits need not worry

(1) to crawl the American opera Paradise, for example, source page: http://www.meijutt.com/new100.html, Project preparation:

Scrapy Startproject meiju100

F:\PYTHON\PYTHONWEBSCRAPING\PYTHONSCRAPYPROJECT>CD meiju100

F:\python\pythonwebscraping\pythonscrapyproject\meiju100>scrapy Genspider Meiju100Spider meijutt.com

Project file Structure:

(2) Modify the items.py file:

(3) Modify the meiju100spider.py file:

First check the Web page source code: Found <div class= "Lasted-num fn-left" > The beginning of the label, containing the required data:

#-*-Coding:utf-8-*-
Import Scrapy
From Meiju100.items import Meiju100item

Class Meiju100spiderspider (Scrapy. Spider):
name = ' Meiju100spider '
Allowed_domains = [' meijutt.com ']
Start_urls = (
' Http://www.meijutt.com/new100.html '
)

Def parse (self, Response):
Subselector=response.xpath ('//li/div[@class = "Lasted-num fn-left"])
Items=[]
For sub in Subselector:
Item=meiju100item ()
item[' Storyname ']=sub.xpath ('.. /h5/a/text () '). Extract () [0]
item[' storystate ']=sub.xpath ('.. /span[@class = "State1 new100state1"]/text () '). Extract () [0]
item[' tvstation ']=sub.xpath ('.. /span[@class = "MJTV"]/text () '). Extract ()
            item[' updatetime ']=sub.xpath ('//div[@class = ' lasted-time new100time fn-right ']/text () '). Extract () [0]// Run Error: Indexerror:list index out of Range,<div class= "lasted-time new100time fn-right" > parent node not belonging to top 
Items.append (item)
            
return items

(4) write the pipelinses.py file, save the crawled data to the folder:

#-*-Coding:utf-8-*-

# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Import time

Class Meiju100pipeline (object):
    def process_item (self, item, spider):
        Today=time.strftime ('%y%m%d ', Time.localtime ())
        filename=today+ ' Meiju.txt '
        With open (FileName, ' a ') as FP:
            Fp.write ("%s \ T"% (item[' storyname '].encode (' UTF8 ')))
            Fp.write ("%s \ T"% (item[' storystate '].encode (' UTF8 ')))
            If Len (item[' tvstation ']) ==0:
                Fp.write ("unknow \ T")
            Else
                Fp.write ("%s \ T"% (item[' tvstation '][0]). Encode (' UTF8 '))
            Fp.write ("%s \ n"% (item[' updatetime '].encode (' UTF8 ')))

        Return item

(5) Modify the settings.py file:

(6) Under any directory under the Meiju project, run the command: Scrapy crawl Meiju100spider

Operation Result:

2. Blocking interval time hack: scrapy time between two requests Download_delay, if the anti-crawler factors are not considered, this value is of course the smaller the better,

If you set the value of Download_delay to 0.1, that is, every 0.1 seconds the Web page is requested.

Therefore, you need to append this item to the tail of the settings.py:

3. Block cookies hack: It is well known that the website uses cookies to determine the identity of the user, and the scrapy crawlers use the same cookie to send requests when crawling data, and this practices and Download_ Delay is set to 0.1 with no difference.

Therefore, to crack this principle of anti-Crawler is also very simple, directly disable the cookies can be, in the setting.py file after appending an item:

2017.08.10 Python crawler Combat attack

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

2017.08.10 Python crawler Combat attack

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support