This article starts with the simplest crawler, by adding the detection download error, setting up the user agent, setting up the network agent, and gradually perfecting the crawler function.
First explain the use of the code: in the python2.7 environment, with the command line can also, with pycharm editing can also. By defining the function and then referencing the function to complete the page crawl
Example: download ("HTTP://www.baidu.com")
Download1 ("HTTP://www.baidu.com")
Download2 ("HTTP://www.baidu.com")
1. Use three lines of code
Import Urllib2
Import Urlparse
def download1 (URL):
"" "Simple Downloader" ""
return Urllib2.urlopen (URL). Read ()
2. Upgrade and write a web crawler with download errors
def download2 (URL):
"" "Download function that catches errors" ""
print ' Downloading: ', url
Try
html = urllib2.urlopen (URL). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
return HTML
3. Web 5xx error generally occurs on the server side, the crawler plus a judgment, when the error code is greater than 500 less than 600 when the download continues 2 times,
def download3 (URL, num_retries=2):
"" "Download function that also retries 5XX errors" ""
print ' Downloading: ', url
Try
html = urllib2.urlopen (URL). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = download3 (URL, num_retries-1)
return HTML
4. Set up a user agent
In general, the default web crawler will be blocked by some sites, here set up a "WSWP" for the name of the network agent
def download4 (URL, user_agent= ' wswp ', num_retries=2):
"" "Download function that includes user agent support" "
print ' Downloading: ', url
headers = {' User-agent ': user_agent}
Request = Urllib2. Request (URL, headers=headers)
Try
html = urllib2.urlopen (Request). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = download4 (URL, user_agent, num_retries-1)
return HTML
5. Support Agent
Sometimes we need to use a proxy to access a website. For example, Nteflix shielded most countries outside the United States. We use the requests module to implement the function of the network agent.
Import Urllib2
Import Urlparse
def download5 (URL, user_agent= ' wswp ', Proxy=none, num_retries=2):
"" "Download function with support for proxies" ""
print ' Downloading: ', url
headers = {' User-agent ': user_agent}
Request = Urllib2. Request (URL, headers=headers)
Opener = Urllib2.build_opener ()
If proxy:
Proxy_params = {urlparse.urlparse (URL). Scheme:proxy}
Opener.add_handler (URLLIB2. Proxyhandler (Proxy_params))
Try
html = opener.open (Request). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = DOWNLOAD5 (URL, user_agent, proxy, num_retries-1)
return HTML
Write a web crawler in Python-write the first web crawler from scratch 1