Write a web crawler in Python-write the first web crawler from scratch 1

Source: Internet
Author: User

This article starts with the simplest crawler, by adding the detection download error, setting up the user agent, setting up the network agent, and gradually perfecting the crawler function.
First explain the use of the code: in the python2.7 environment, with the command line can also, with pycharm editing can also. By defining the function and then referencing the function to complete the page crawl
Example: download ("HTTP://www.baidu.com")
        Download1 ("HTTP://www.baidu.com")
        Download2 ("HTTP://www.baidu.com")



1. Use three lines of code

Import Urllib2
Import Urlparse


def download1 (URL):
"" "Simple Downloader" ""
return Urllib2.urlopen (URL). Read ()

2. Upgrade and write a web crawler with download errors
def download2 (URL):
"" "Download function that catches errors" ""
print ' Downloading: ', url
Try
html = urllib2.urlopen (URL). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
return HTML
3. Web 5xx error generally occurs on the server side, the crawler plus a judgment, when the error code is greater than 500 less than 600 when the download continues 2 times,
def download3 (URL, num_retries=2):
"" "Download function that also retries 5XX errors" ""
print ' Downloading: ', url
Try
html = urllib2.urlopen (URL). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = download3 (URL, num_retries-1)
return HTML

4. Set up a user agent
In general, the default web crawler will be blocked by some sites, here set up a "WSWP" for the name of the network agent

def download4 (URL, user_agent= ' wswp ', num_retries=2):
"" "Download function that includes user agent support" "
print ' Downloading: ', url
headers = {' User-agent ': user_agent}
Request = Urllib2. Request (URL, headers=headers)
Try
html = urllib2.urlopen (Request). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = download4 (URL, user_agent, num_retries-1)
return HTML
5. Support Agent
Sometimes we need to use a proxy to access a website. For example, Nteflix shielded most countries outside the United States. We use the requests module to implement the function of the network agent.
Import Urllib2
Import Urlparse
def download5 (URL, user_agent= ' wswp ', Proxy=none, num_retries=2):
"" "Download function with support for proxies" ""
print ' Downloading: ', url
headers = {' User-agent ': user_agent}
Request = Urllib2. Request (URL, headers=headers)
Opener = Urllib2.build_opener ()
If proxy:
Proxy_params = {urlparse.urlparse (URL). Scheme:proxy}
Opener.add_handler (URLLIB2. Proxyhandler (Proxy_params))
Try
html = opener.open (Request). Read ()
Except Urllib2. Urlerror as E:
print ' Download error: ', E.reason
HTML = None
If num_retries > 0:
If Hasattr (E, ' Code ') and <= E.code < 600:
# Retry 5XX HTTP Errors
html = DOWNLOAD5 (URL, user_agent, proxy, num_retries-1)
return HTML


Write a web crawler in Python-write the first web crawler from scratch 1

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.