1. Create a generic crawler: In general, crawlers less than 100 visits need not worry
(1) to crawl the American opera Paradise, for example, source page: http://www.meijutt.com/new100.html, Project preparation:
Scrapy Startproject meiju100
F:\PYTHON\PYTHONWEBSCRAPING\PYTHONSCRAPYPROJECT>CD meiju100
F:\python\pythonwebscraping\pythonscrapyproject\meiju100>scrapy Genspider Meiju100Spider meijutt.com
Project file Structure:
(2) Modify the items.py file:
(3) Modify the meiju100spider.py file:
First check the Web page source code: Found <div class= "Lasted-num fn-left" > The beginning of the label, containing the required data:
#-*-Coding:utf-8-*-
Import Scrapy
From Meiju100.items import Meiju100item
Class Meiju100spiderspider (Scrapy. Spider):
name = ' Meiju100spider '
Allowed_domains = [' meijutt.com ']
Start_urls = (
' Http://www.meijutt.com/new100.html '
)
Def parse (self, Response):
Subselector=response.xpath ('//li/div[@class = "Lasted-num fn-left"])
Items=[]
For sub in Subselector:
Item=meiju100item ()
item[' Storyname ']=sub.xpath ('.. /h5/a/text () '). Extract () [0]
item[' storystate ']=sub.xpath ('.. /span[@class = "State1 new100state1"]/text () '). Extract () [0]
item[' tvstation ']=sub.xpath ('.. /span[@class = "MJTV"]/text () '). Extract ()
item[' updatetime ']=sub.xpath ('//div[@class = ' lasted-time new100time fn-right ']/text () '). Extract () [0]// Run Error: Indexerror:list index out of Range,<div class= "lasted-time new100time fn-right" > parent node not belonging to top
Items.append (item)
return items
(4) write the pipelinses.py file, save the crawled data to the folder:
#-*-Coding:utf-8-*-
# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Import time
Class Meiju100pipeline (object):
def process_item (self, item, spider):
Today=time.strftime ('%y%m%d ', Time.localtime ())
filename=today+ ' Meiju.txt '
With open (FileName, ' a ') as FP:
Fp.write ("%s \ T"% (item[' storyname '].encode (' UTF8 ')))
Fp.write ("%s \ T"% (item[' storystate '].encode (' UTF8 ')))
If Len (item[' tvstation ']) ==0:
Fp.write ("unknow \ T")
Else
Fp.write ("%s \ T"% (item[' tvstation '][0]). Encode (' UTF8 '))
Fp.write ("%s \ n"% (item[' updatetime '].encode (' UTF8 ')))
Return item
(5) Modify the settings.py file:
(6) Under any directory under the Meiju project, run the command: Scrapy crawl Meiju100spider
Operation Result:
2. Blocking interval time hack: scrapy time between two requests Download_delay, if the anti-crawler factors are not considered, this value is of course the smaller the better,
If you set the value of Download_delay to 0.1, that is, every 0.1 seconds the Web page is requested.
Therefore, you need to append this item to the tail of the settings.py:
3. Block cookies hack: It is well known that the website uses cookies to determine the identity of the user, and the scrapy crawlers use the same cookie to send requests when crawling data, and this practices and Download_ Delay is set to 0.1 with no difference.
Therefore, to crack this principle of anti-Crawler is also very simple, directly disable the cookies can be, in the setting.py file after appending an item:
2017.08.10 Python crawler Combat attack