2017.08.10 Python crawler Combat attack

Source: Internet
Author: User

1. Create a generic crawler: In general, crawlers less than 100 visits need not worry

(1) to crawl the American opera Paradise, for example, source page: http://www.meijutt.com/new100.html, Project preparation:

Scrapy Startproject meiju100

F:\PYTHON\PYTHONWEBSCRAPING\PYTHONSCRAPYPROJECT>CD meiju100

F:\python\pythonwebscraping\pythonscrapyproject\meiju100>scrapy Genspider Meiju100Spider meijutt.com

Project file Structure:

(2) Modify the items.py file:

(3) Modify the meiju100spider.py file:

First check the Web page source code: Found <div class= "Lasted-num fn-left" > The beginning of the label, containing the required data:

#-*-Coding:utf-8-*-
Import Scrapy
From Meiju100.items import Meiju100item

Class Meiju100spiderspider (Scrapy. Spider):
name = ' Meiju100spider '
Allowed_domains = [' meijutt.com ']
Start_urls = (
' Http://www.meijutt.com/new100.html '
)

Def parse (self, Response):
Subselector=response.xpath ('//li/div[@class = "Lasted-num fn-left"])
Items=[]
For sub in Subselector:
Item=meiju100item ()
item[' Storyname ']=sub.xpath ('.. /h5/a/text () '). Extract () [0]
item[' storystate ']=sub.xpath ('.. /span[@class = "State1 new100state1"]/text () '). Extract () [0]
item[' tvstation ']=sub.xpath ('.. /span[@class = "MJTV"]/text () '). Extract ()
item[' updatetime ']=sub.xpath ('//div[@class = ' lasted-time new100time fn-right ']/text () '). Extract () [0]// Run Error: Indexerror:list index out of Range,<div class= "lasted-time new100time fn-right" > parent node not belonging to top
Items.append (item)

return items


  (4) write the pipelinses.py file, save the crawled data to the folder:

#-*-Coding:utf-8-*-

# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Import time

Class Meiju100pipeline (object):
def process_item (self, item, spider):
Today=time.strftime ('%y%m%d ', Time.localtime ())
filename=today+ ' Meiju.txt '
With open (FileName, ' a ') as FP:
Fp.write ("%s \ T"% (item[' storyname '].encode (' UTF8 ')))
Fp.write ("%s \ T"% (item[' storystate '].encode (' UTF8 ')))
If Len (item[' tvstation ']) ==0:
Fp.write ("unknow \ T")
Else
Fp.write ("%s \ T"% (item[' tvstation '][0]). Encode (' UTF8 '))
Fp.write ("%s \ n"% (item[' updatetime '].encode (' UTF8 ')))

Return item

(5) Modify the settings.py file:

(6) Under any directory under the Meiju project, run the command: Scrapy crawl Meiju100spider

Operation Result:

2. Blocking interval time hack: scrapy time between two requests Download_delay, if the anti-crawler factors are not considered, this value is of course the smaller the better,

If you set the value of Download_delay to 0.1, that is, every 0.1 seconds the Web page is requested.

Therefore, you need to append this item to the tail of the settings.py:

3. Block cookies hack: It is well known that the website uses cookies to determine the identity of the user, and the scrapy crawlers use the same cookie to send requests when crawling data, and this practices and Download_ Delay is set to 0.1 with no difference.

Therefore, to crack this principle of anti-Crawler is also very simple, directly disable the cookies can be, in the setting.py file after appending an item:

2017.08.10 Python crawler Combat attack

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.