1. Project Preparation: Crawl site: http://www.proxy360.cn/Region/China,http://www.xicidaili.com/
2. Create an edit scrapy crawler:
Scrapy Startproject GetProxy
Scrapy Genspider Proxy360spider proxy360.cn
Project directory Structure:
3. Modify items.py:
4. Modify the spider.py file proxy360spider.py:
(1) First use the Scrapy shell command to see the results and data returned by the connected network:
Scrapy Shell Http://www.proxy360.cn/Region/China
(2) Take a look at response data content: Response.xpath ('/* '). Extract (), the returned data contains a proxy server;
(3) Observe that all data modules are based on <div class= "proxylistitem" name= "list_proxy_ip" > This tag begins with:
(4) test it in the Scrapy shell:
Subselector=response.xpath ('//div[@class = ' Proxylistitem ' and @name = ' list_proxy_ip '] ')
Subselector.xpath ('.//span[1]/text () '). Extract () [0]
Subselector.xpath ('.//span[2]/text () '). Extract () [0]
Subselector.xpath ('.//span[3]/text () '). Extract () [0]
Subselector.xpath ('.//span[4]/text () '). Extract () [0]
(5) Writing spider file proxy360spider.py:
#-*-Coding:utf-8-*-
Import Scrapy
From Getproxy.items import Getproxyitem
Class Proxy360spiderspider (Scrapy. Spider):
name = ' Proxy360spider '
Allowed_domains = [' proxy360.cn ']
nations=[' Brazil ', ' China ', ' Amercia ', ' Taiwan ', ' Japan ', ' Thailand ', ' Vietnam ', ' Bahrein ']
Start_urls=[]
For nation in Nations:
start_urls.append (' http://www.proxy360.cn/Region/' +nation)
Def parse (self, Response):
Subselector=response.xpath ('//div[@class = ' Proxylistitem ' and @name = ' list_proxy_ip '] ')
Items=[]
For sub in Subselector:
Item=getproxyitem ()
item[' IP ']=sub.xpath ('.//span[1]/text () '). Extract () [0]
item[' Port ']=sub.xpath ('.//span[2]/text () '). Extract () [0]
item[' type ']=sub.xpath ('.//span[3]/text () '). Extract () [0]
item[' loction ']=sub.xpath ('.//span[4]/text () '). Extract () [0]
item[' protocol ']= ' HTTP '
item[' source ']= ' proxy360 '
Items.append (item)
return items
(6) Modify the pipelines.py file to handle:
#-*-Coding:utf-8-*-
# Define your item pipelines here
#
# Don ' t forget to add your pipeline to the Item_pipelines setting
# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Class Getproxypipeline (object):
def process_item (self, item, spider):
Filename= ' Proxy.txt '
With open (FileName, ' a ') as FP:
Fp.write (item[' IP '].encode (' UTF8 '). Strip () + ' \ t ')
Fp.write (item[' Port '].encode (' UTF8 '). Strip () + ' \ t ')
Fp.write (item[' protocol '].encode (' UTF8 '). Strip () + ' \ t ')
Fp.write (item[' type '].encode (' UTF8 '). Strip () + ' \ t ')
Fp.write (item[' loction '].encode (' UTF8 '). Strip () + ' \ t ')
Fp.write (item[' source '].encode (' UTF8 '). Strip () + ' \ n ')
Return item
(7) Modify the settings.py to decide which file to process the obtained data:
(8) Result of execution:
5. Multiple spiders, only one spdier time to get the proxy data is not enough:
(1) to GetProxy directory, execute: scrapy genspider xicispider xicidaili.com
(2) determine how to obtain data: scrapy Shell HTTP://WWW.XICIDAILI.COM/NN/2
(3) Just add a user_agent item to the settings.py.
Test how to get the data again: Scrapy Shell HTTP://WWW.XICIDAILI.COM/NN/2
(4) View the source code in the browser: discover that the required data blocks are <tr class= "odd" >
(5) Execute the command in the Scrapy Shell:
Subselector=response.xpath ('//tr[@class = ""]|//tr[@class = "odd"]
Subselector[0].xpath ('.//td[2]/text () '). Extract () [0]
Subselector[0].xpath ('.//td[3]/text () '). Extract () [0]
Subselector[0].xpath ('.//td[4]/a/text () '). Extract () [0]
Subselector[0].xpath ('.//td[5]/text () '). Extract () [0]
Subselector[0].xpath ('.//td[6]/text () '). Extract () [0]
(6) Writing xicispider.py:
#-*-Coding:utf-8-*-
Import Scrapy
From Getproxy.items import Getproxyitem
Class Xicispdierspider (Scrapy. Spider):
name = ' Xicispdier '
Allowed_domains = [' xicidaili.com ']
wds=[' nn ', ' nt ', ' wn ', ' wt ']
Pages=20
Start_urls=[]
For type in WDS:
For I in Xrange (1,pages+1):
Start_urls.append (' http://www.xicidaili.com/' +type+ '/' +str (i))
Def parse (self, Response):
Subselector=response.xpath ('//tr[@class = ""]|//tr[@class = "odd"]
Items=[]
For sub in Subselector:
Item=getproxyitem ()
item[' IP ']=sub.xpath ('.//td[2]/text () '). Extract () [0]
item[' Port ']=sub.xpath ('.//td[3]/text () '). Extract () [0]
item[' type ']=sub.xpath ('.//td[5]/text () '). Extract () [0]
If Sub.xpath ('.//td[4]/a/text () '):
item[' loction ']=sub.xpath ('.//td[4]/a/text () '). Extract () [0]
Else
item[' loction ']=sub.xpath ('.//td[4]/text () '). Extract () [0]
item[' protocol ']=sub.xpath ('.//td[6]/text () '). Extract () [0]
item[' source ']= ' Xicidaili '
Items.append (item)
return items
(7) Execution: Scrapy crawl Xicispider
Results:
6. Verify that the obtained proxy address is available: Also write a Python program Authentication agent: testproxy.py
#! /usr/bin/env python
#-*-Coding:utf-8-*-
Import Urllib2
Import re
Import threading
Class Tesyproxy (object):
def __init__ (self):
Self.sfile=r ' Proxy.txt '
Self.dfile=r ' Alive.txt '
Self. Url=r ' http://www.baidu.com/'
self.threads=10
Self.timeout=3
Self.regex=re.compile (R ' baidu.com ')
Self.alivelist=[]
Self.run ()
def run (self):
With open (Self.sfile, ' R ') as FP:
Lines=fp.readlines ()
Line=lines.pop ()
While lines:
For I in Xrange (self.threads):
T=threading. Thread (target=self.linkwithproxy,args= (line,))
T.start ()
If lines:
Line=lines.pop ()
Else
Continue
With open (Self.dfile, ' W ') as FP:
For i in Xrange (Len (self.alivelist)):
Fp.write (Self.alivelist[i])
def linkwithproxy (self,line):
Linelist=line.split (' \ t ')
Protocol=linelist[2].lower ()
Server=protocol+r '://' +linelist[0]+ ': ' +linelist[1]
Opener=urllib2.build_opener (URLLIB2. Proxyhandler ({protocol:server}))
Urllib2.install_opener (opener)
Try
Response=urllib2.urlopen (self. Url,timeout=self.timeout)
Except
Print ('%s connect failed '%server)
Return
Else
Try
Str=response.read ()
Except
Print ('%s connect failed '%server)
Return
If Self.regex.search (str):
Print ('%s connect success .....%server. ')
Self.aliveList.append (line)
if __name__ = = ' __main__ ':
Tp=tesyproxy ()
Execute command: Python testproxy.py
Results:
2017.08.05 python web crawler real-get agent