1. Set up a user agent
By default, URLIIB2 uses Python-urllib, 2.7 as the user agent to download Web page content, where 2.7 is the Python version number. To prevent some websites from blocking this default user agent, to ensure that the download is more reliable, we need to control the user agent settings. The following code sets a user agent named "wswp" to the download function.
Import Urllib2
def download (url,user_agent= ' wswp ', num_retries=2):
print ' Downloading: ', url
headers={' user-agent ': user_agent}
Request=urllib2. Request (Url,headers=headers)
Try
Html=urllib2.urlopen (URL). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
Html=none
If num_retries>0:
If Hasattr (E, ' Code ') and 500<=e.code<600:
#recursively Retry 5XX HTTP Errors
return download (URL, user_agent,num_retries-1)
return HTML
Using the Python 0 base crawler-writing the first web crawler-2 Setting up a user agent