05-access timeout settings
Zheng Yi 201005 is affiliated with Section 01. Data Capturing
Set HTTP or socket access timeout to prevent crawlers from crawling a page for too long.
In the pycurl Library call, you can set the timeout time:
C. setopt (pycurl. connecttimeout, 60)
In Python 2.6, The httplib Library has the following constructor:
Class httpconnection:
Def _ init _ (self, host, Port = none, strict = none,
Timeout= Socket. _ global_default_timeout ):
Self. Timeout = timeout
So you can set:
>>> H3 = httplib. httpconnection ('www. CWI. nl ', 80,Timeout = 10)
See document #2452: timeout is used for all blocking operations:
If the time-out period is specified through the constructor of httpconnection or httpsconnection, blocking operations (such as attempting to establish a connection) Will time out. If none is not assigned or the value is set, the global timeout value is used.
In Python 2.5, because the _ init _ function of the httpconnection class does not have the timeout parameter, a deep function is hidden:
Httplib. Socket. setdefatimetimeout (3) # The unit of the input parameter seems to be minute.
To set timeout.
Set global timeout
Finally, if no function can be found to set the timeout time during the capture, you can set the global socket timeout, although this is not suitable:
>>> Import socket
>>> Socket. setdefatimetimeout (90)
Setdefatimetimeout () was a hack to allow to set the timeout when nothing else is available.
How to capture timeout exceptions? Example:
From urllib2 import urlopen
Import socket
Slowurl = "http://www.wenxuecity.com /"
Socket. setdefatimetimeout (1)
Try:
Data = urlopen (slowurl)
Data. Read ()
Failed t socket. Error:
Errno, errstr = SYS. exc_info () [: 2]
If errno =Socket. Timeout:
Print "there was a timeout"
Else:
Print "there was some other socket error"
Reference resources:
1. No way to disable socket timeouts in httplib, etc.
2. How to catch socket timeout?