First of all, let's keep you waiting. Originally intended to be updated 520 that day, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But busy 521,522 this day and a half, I have added the database, fixed some bugs (now someone will say that really is a single dog).
Well, don't say much nonsense, let's go into today's theme. On two articles scrapy climbed beautiful pictures, we explained the use of scrapy. But just recently, there are enthusiastic friends told me before the program can not crawl to the picture, I guess it should be fried egg net added anti-crawler mechanism. So today is to break the anti-crawler mechanism of the previous agent IP.
Now a lot of Web site anti-crawler One of the practices (of course, there are other tests) is: the detection of an IP repetitive operation, so as to determine whether the crawler or manual. So using proxy IP can break this blockade. As a student party, no money specifically to buy VPN and IP pool, so we use the proxy IP from the network free, basically enough for personal use. Next we'll talk about crawling free IP and verifying the availability of the proxy IP.
Online has a lot of proxy IP site, this time I chose is http://www.xicidaili.com/nn/, we can try other websites, we try to do a large proxy IP pool.
Whether you notice the high-hide two words, high-hidden means: The other server does not know that you use the agent, but also do not know your real IP, so the concealment is very high.
Seriously.
Follow our previous learning crawler, using the Firebug review element to see how to parse the HTML.
Actually is a table, parse inside each row, this is very simple, we use beautifulsoup very easy to parse out.
It should also be noted that the number of pages in the IP table on each page is corresponding to the parameters in the URL. For example, the first page is HTTP://WWW.XICIDAILI.COM/NN/1. This saves us the trouble of turning the page.
Here is the structure of the program:
DB package in Db_helper: the implementation of MongoDB additions and deletions to the search.
Detect in Package Detect_proxy: Verifying the availability of proxy IPs
Proxy_info in the Entity package: object to proxy information
Spider Bag:
Spiderman Realizing the logic of the crawler
Html_downloader implements the crawler HTML downloader
Html_parser implements a crawler HTML parser
Test package: Testing of the sample, not involving program run
main.py: Implementing command-line parameter definitions
Also say the detection: I use http://ip.chinaz.com/getip.aspx as the detection URL, as long as the use of proxy access does not time out, and the response code is 200, we think is a successful agent.
Next run the program to see the effect:
Switch to the project directory under Windows, run Python main.py-h, and see the usage instructions and parameter settings I defined.
Then run Python main.py-c 1 4 (meaning crawl 1-4 pages of IP address):
This time if you want to verify the correctness of the IP: Run Python main.py-t db
It seems that the use of IP is still relatively small, but for the individual is enough.
Take a look at the MongoDB database:
We can use these IPs the next time we crawl the pictures.
The following code for parsing and validation is posted:
1234567891011121314151617181920212223 |
def
parser(
self
,html_cont):
‘‘‘
:param html_cont:
:return:
‘‘‘
if
html_cont
is
None
:
return
# 使用BeautifulSoup模块对html进行解析
soup
=
BeautifulSoup(html_cont,
‘html.parser‘
,from_encoding
=
‘utf-8‘
)
tr_nodes
=
soup.find_all(
‘tr‘
,
class_
=
True
)
for
tr_node
in
tr_nodes:
proxy
=
proxy_infor()
i
=
0
for
th
in
tr_node.children:
if th.string !
=
None
and
len
(th.string.strip()) >
0
:
proxy.proxy[proxy.proxyName[i]]
=
th.string.strip()
print
‘proxy‘
,th.string.strip()
i
+
=
1
if
(i>
1
):
break
self
.db_helper.insert({proxy.proxyName[
0
]:proxy.proxy[proxy.proxyName[
0
]],proxy.proxyName[
1
]:proxy.proxy[proxy.proxyName[
1
]]},proxy.proxy)
|
Verify part of the core code:
123456789101112131415161718192021222324252627282930 |
def
detect(
self
):
‘‘‘
http://ip.chinaz.com/getip.aspx 作为检测目标
:return:
‘‘‘
proxys
=
self
.db_helper.proxys.find()
badNum
=
0
goodNum
=
0
for
proxy
in
proxys:
ip
=
proxy[
‘ip‘
]
port
=
proxy[
‘port‘
]
try
:
proxy_host
=
"http://"
+
ip
+
‘:‘
+
port
#
response
=
urllib.urlopen(
self
.url,proxies
=
{
"http"
:proxy_host})
if
response.getcode()!
=
200
:
self
.db_helper.delete({
‘ip‘
:ip,
‘port‘
:port})
badNum
+
=
1
print
proxy_host,
‘bad proxy‘
else
:
goodNum
+
=
1
print
proxy_host,
‘success proxy‘
except
Exception,e:
print
proxy_host,
‘bad proxy‘
self
.db_helper.delete({
‘ip‘
:ip,
‘port‘
:port})
badNum
+
=
1
continue
print
‘success proxy num : ‘
,goodNum
print
‘bad proxy num : ‘
,badNum
|
Today's share is here, if you think you can ah, remember to support me. Code uploaded on GitHub: Https://github.com/qiyeboy/proxySpider_normal
You are welcome to support me. Public number:
This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/
Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)