"Python crawler 1" web crawler introduction __python

Source: Internet
Author: User
Tags connection reset time interval python web crawler web2py

Research Target website background 1 Check robotstxt 2 Check site Map 3 estimate site size 4 Identify site All Technology 5 Find site owner first web crawler 1 download Web page retry Download Settings user Agent User_agent 2 crawl site Map 3 Calendar database ID for each page 4 Tracking Web links Advanced function resolution Robotstxt support proxy download speed limit avoid the final version of the Reptile Trap

1 Research Target website background 1.1 Check robots.txt

Http://example.webscraping.com/robots.txt

# section 1
user-agent:badcrawler
Disallow:/

# section 2
user-agent: *
crawl-delay:5
Disallow:/trap 

# section 3
sitemap:http://example.webscraping.com/sitemap.xml
Section 1: Prohibits user agents from crawling the site for Badcrawler crawlers, unless the malicious crawler. Section 2: Two download request time interval 5 seconds crawl delay. /trap is used to ban malicious reptiles and will be banned for 1 minutes. Section 3: Define a sitemap file, as the following sections say. 1.2 Check site map

All Web Links: http://example.webscraping.com/sitemap.xml

<urlset xmlns= "http://www.sitemaps.org/schemas/sitemap/0.9" >
<url>
<loc>http:// example.webscraping.com/view/afghanistan-1</loc>
</url>
<url>
<loc>
Http://example.webscraping.com/view/Aland-Islands-2
</loc>
</url>
...
<url>
<loc>http://example.webscraping.com/view/Zimbabwe-252</loc>
</url>
</urlset>
1.3 Estimating Site size

Advanced Search Parameters: Http://www.google.com/advanced_search
Google search: site:http://example.webscraping.com/has 202 pages
Google search: Site:http://example.webscraping.com/view has 117 web pages 1.4 identify all technology

Use the buildwith module to check the technical types of web site construction.
Installation Gallery: Pip install Buildwith

>>> import Builtwith
>>> builtwith.parse (' http://example.webscraping.com ')
{ U ' javascript-frameworks ': [u ' jquery ', U ' Modernizr ', U ' jquery UI '],
 u ' web-frameworks ': [u ' web2py ', U ' Twitter Bootstrap '],
 u ' programming-languages ': [u ' Python '],
 u ' web-servers ': [u ' Nginx ']}

The example Web site uses the Python web2py framework and uses a JavaScript library, possibly embedded in HTML. This easy to crawl. Other construction types:
-Angularjs: Content dynamic loading
-ASP. NET: Crawl Web pages are submitted using Session management and form. (Chapters 5th and 6th) 1.5 Looking for site owners

The whois protocol is used to query domain name registrars.
Documents: Https://pypi.python.org/pypi/python-whois
Installation: Pip Install Python-whois

>>> import whois
>>> print whois.whois (' appspot.com ')
{...]
 Name_servers ": [
    " NS1. Google.com ", 
    ...
    " Ns2.google.com ", 
    " ns1.google.com "
  ], 
  " org ":" Google Inc. ", 
  " creation_date ": [
    " 2005-03-10 00:00:00 ", 
    " 2005-03-09t18:27:55-0800 "
  ], 
  " emails ": [
    " abusecomplaints@markmonitor.com ", 
    "dns-admin@google.com"
  ]
}

This domain name belongs to Google, uses the Google App engine service. Note: Google often blocks web crawlers. 2 first web crawler

There are many ways to crawl (crawling) a Web site, and choosing which method is more appropriate depends on the structure of the target site.
Here's how to safely 1.4.1 a Web page and then introduce 3 ways to crawl Web sites:
-2.2 Crawl Site map;
-2.3 Traverse the database ID of each page;
-2.4 Tracking Web links. 2.1 Download Web page

1.4.1download1.py

#-*-Coding:utf-8-*-

import urllib2

def download1 (URL): "" "Simple
    Downloader" ""
    Urllib2.urlopen (URL). Read ()

if __name__ = = ' __main__ ':
    print download1 (' https://www.baidu.com ')

1.4.1download2.py

def download2 (URL): "" "
    Download function that catches errors" "
    print ' downloading: ', url
    try:
        html = Urllib2.urlopen (URL). Read ()
    except URLLIB2. Urlerror as E:
        print ' Download error: ', E.reason
        html = None return
    HTML
1. Retry the download

When server overload returns 503 Service unavailable error, we can try downloading again. If this error is 404 Not Found, the Web page does not exist at this time, and there is no request to try two.
1.4.1download3.py

def download3 (URL, num_retries=2): "" "
    Download function that also retries 5XX errors" ""
    print ' downloading: ', URL
    try:
        html = urllib2.urlopen (URL). Read ()
    except URLLIB2. Urlerror as E:
        print ' Download error: ', E.reason
        html = None
        if num_retries > 0:
            if Hasattr (E, ' code ') <= E.code <:
                # retry 5XX HTTP errors
                html = download3 (URL, num_retries-1) return
    html
  
   download = Download3

if __name__ = = ' __main__ ':
    print Download (' http://httpstat.us/500 ')
  

Internet Engineering Task Group defines a complete list of HTTP errors: https://tools.ietf.org/html/rfc7231#section-6
-4xx: Error present request problem
-5xx: Error appears on service side problem 2. Set up user agent (user_agent)

By default, URLLIB2 uses python-urllib/2.7 to download Web content as a user agent, where 2.7 is the Python version number. Some websites also ban the default user agent if the quality of the Python web crawler (the code above) can overload the server. For example, when using the Python Default User agent, access to https://www.meetup.com/appears:

WU_BEING@UBUNTUKYLIN64:~/GITHUB/WEBSCRAPINGWITHPYTHON/1. Introduction to web crawler $ python 1.4.1download4.py 
downloading:https:/ /www.meetup.com/
Download Error: [Errno] Connection reset by peer
None
wu_being@ubuntukylin64:~/ GITHUB/WEBSCRAPINGWITHPYTHON/1. Introduction to web crawler $ python 1.4.1download4.py 
downloading:https://www.meetup.com/
Download Error:forbidden
None

In order to download more reliable, we need to set control user agent, the following code set a user agent wu_being.

def download4 (URL, user_agent= ' wu_being ', num_retries=2): "" "
    Download function that includes user agent support" " C1/>print ' downloading: ', url
    headers = {' User-agent ': user_agent}
    request = Urllib2. Request (URL, headers=headers)
    try:
        HTML = urllib2.urlopen (Request). Read ()
    except URLLIB2.  Urlerror as E:
        print ' Download error: ', E.reason
        html = None
        if num_retries > 0:
            if Hasattr (E, ' code ') <= E.code <:
                # retry 5XX HTTP errors
                html = download4 (URL, user_agent, num_retries-1)
    re Turn HTML
2.2 Crawl Site Map

We download all the pages from the site map sitemap.xml found in the robots.txt file of the sample URL. To parse the site map, we use a simple regular expression to extract the URL from the <loc> tag. In the next chapter, we introduce a kind of analytic method of stronger bond--CSS selector

#-*-Coding:utf-8-*-

Import re from
common import Download

def crawl_sitemap (URL):
    # Download the Sitem AP file
    sitemap = Download (URL)
#>downloading:http://example.webscraping.com/sitemap.xml
    # Extract The Sitemap links Links
    = Re.findall (' <loc> (. *?) </loc> sitemap)
    # Download each link to link in the
    links:
        html = download (link)
        # Scrape HTML here
        # ...
#>downloading:http://example.webscraping.com/view/afghanistan-1
#>downloading:http:// Example.webscraping.com/view/aland-islands-2
#>downloading:http://example.webscraping.com/view/ Albania-3
#>

... if __name__ = = ' __main__ ':
    crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
2.3 Traverse each page's database ID
Http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/China-47
http ://example.webscraping.com/view/zimbabwe-252

Since these URLs have only different suffixes, the input HTTP://EXAMPLE.WEBSCRAPING.COM/VIEW/47 can also display the China page normally, all of which we can traverse ID to download all the country pages.

Import Itertools from
common import Download

def iteration (): For
    page in Itertools.count (1):
        url = ' http://example.webscraping.com/view/-%d '% page
        #url = ' http://example.webscraping.com/view/-{} '. Format (page)
        html = Download (URL)
        If HTML is None:
            # received a error trying to Download this webpage
            # so assume hav E reached the last country ID with can stop downloading
            break
        else:
            # Success-can Scrape the result
            # ...
            Pass

If some IDs are discontinuous, the crawler exits at a breakpoint and can be modified to 5 consecutive downloads to stop the traversal.

def iteration ():
    max_errors = 5 # Maximum number of consecutive download errors allowed num_errors
    = 0 # current N Umber of consecutive download errors for
    page in Itertools.count (1):
        url = ' Http://example.webscraping.com/view /-{} '. Format (page)
        html = download (URL)
        If HTML is None:
            # received a error trying to download this webpage
  num_errors + + 1
            if num_errors = = max_errors:
                # Reached maximum amount of errors in a row so exit
                Break
  # so assume have reached the last country ID and can stop downloading
        else:
            # Success-can Scrape the result<
            c14/># ... num_errors = 0

Some sites do not use a sequential ID, or do not use a numeric ID, this method is difficult to play a role. 2.4 Tracking Web links

We need to make the crawler more like ordinary users, can track links, access to the content of interest. But it is easy to download a lot of pages we do not need, such as we crawl from a forum user account Details page, do not need other pages, we need to use regular expressions to determine which page.

downloading:http://example.webscraping.com
Downloading:/index/1
traceback (most recent call last):
  File "1.4.4link_crawler1.py", line, in <module>
    link_crawler (' http://example.webscraping.com ', '/(Index |view)
  File "1.4.4link_crawler1.py", line one, in Link_crawler
    html = Download (URL) ...
  File "/usr/lib/python2.7/urllib2.py", line 283, in Get_type
    raise ValueError, "Unknown URL type:%s"% self.__original
valueerror:unknown URL Type:/INDEX/1

Since/INDEX/1 is a relative link, the browser can recognize it, but urllib2 cannot know the context, all of which we can use the Urlparse module to convert to an absolute link.

def link_crawler (Seed_url, Link_regex):
    crawl_queue = [Seed_url]
    seen = set (crawl_queue) # keep track which URL ' s Have seen before while
    crawl_queue:
        url = crawl_queue.pop ()
        html = Download (URL) for
        link in get_links (h TML):
            if Re.match (Link_regex, link):  #匹配正则表达式
                link = urlparse.urljoin (seed_url, link)
                crawl_ Queue.append (link)

The above code still has a problem, these sites are linked to each other, Australia link to Antarctica, Antarctica link to Australia, so the crawler will continue to download the same content. To avoid duplicate downloads, modifying the function above has the ability to store discovery URLs.

def link_crawler (Seed_url, Link_regex): "" "Crawl from the
    given seed URL following links matched by Link_regex" "
    crawl_queue = [Seed_url]
    seen = set (crawl_queue) # keep track which URL ' s have seen before while
    Crawl_queue:
        URL = crawl_queue.pop ()
        html = Download (URL) for
        link in get_links (HTML):
            # Check if link matches expecte D regex
            If Re.match (Link_regex, link):
                # form Absolute link
                = urlparse.urljoin (seed_url, link)
                # Check if have already seen this link
                if link isn't in seen:
                    seen.add (link)
                    crawl_queue.append (link)
Advanced Features 1. Analytic robots.txt

The Robotparser module first loads the robots.txt file and then uses the Can_fetch () function to determine whether the specified user agent is allowed to access the Web page.

>>> import robotparser
>>> rp=robotparser. Robotfileparser ()
>>> rp.set_url (' http://example.webscraping.com/robots.txt ')
>>> Rp.read ()
>>> url= ' http://example.webscraping.com '
>>> user_agent= ' Badcrawler
' >>> Rp.can_fetch (user_agent,url)
False
>>> user_agent= ' Goodcrawler '
>>> Rp.can_fetch (user_agent,url)
True
>>> user_agent= ' wu_being '
>>> rp.can_fetch ( User_agent,url)
True

In order to integrate this functionality into the crawler, we need to add the check in the crawl loop.

    While crawl_queue:
    url = crawl_queue.pop ()
    # Check URL passes robots.txt restrictions
    if Rp.can_fetch ( User_agent, URL):
        ...
    else:
        print ' Blocked by robots.txt: ', url
2. Support Agent (proxy)

Sometimes we need to use a proxy to access a Web site. Netflix, for example, shields most countries outside the United States. Using the URLLIB2 support agent is not as easy as it might be (try using the more intimate Python HTTP module requests to implement this feature, document: http://docs.python-requests.org). The following is code that supports proxies using URLLIB2.

def download5 (URL, user_agent= ' wswp ', Proxy=none, num_retries=2): "" "
    Download function with support for proxies" " C1/>print ' downloading: ', url
    headers = {' User-agent ': user_agent}
    request = Urllib2. Request (URL, headers=headers)
    opener = Urllib2.build_opener ()
    if proxy:
        proxy_params = { Urlparse.urlparse (URL). Scheme:proxy}
        Opener.add_handler (urllib2. Proxyhandler (proxy_params))
    try:
        HTML = opener.open (Request). Read ()
    except URLLIB2. Urlerror as E:
        print ' Download error: ', E.reason
        html = None
        if num_retries > 0:
            if Hasattr (E, ' Cod E ') and <= E.code <:
                # retry 5XX HTTP errors
                html = DOWNLOAD5 (URL, user_agent, proxy, num_retries- 1) return
    HTML
3. Download speed limit

When we crawl the site too fast, may be banned or caused by the risk of server overload. To reduce these risks, we can add a delay between two downloads to limit the speed of the crawler.

Class Throttle: "" "
    throttle downloading by sleeping between requests to same domain" "
    def __init__ (self, de Lay):
        # Amount of delay between downloads for each domain
        self.delay = delay
        # Timestamp of ' When a ' domain was Last accessed
        self.domains = {}

    def wait (self, url):
        domain = urlparse.urlparse (URL). Netloc
        Last_ Accessed = self.domains.get (domain)

        if self.delay > 0 and last_accessed are not None:
            sleep_secs = Self.delay -(DateTime.Now ()-last_accessed). Seconds
            If sleep_secs > 0:
                time.sleep (sleep_secs)
        self.domains[ Domain] = DateTime.Now ()

The throttle class records the time of each last visit and performs a sleep operation if the current time distance is less than the specified delay. We can call throttle to speed limits on the crawler before each download.

Throttle = throttle (delay) ...
Throttle.wait (URL)
html = download (URL, headers, proxy=proxy, num_retries=num_retries)
4. Avoid Reptile traps

One simple way to avoid being caught in a reptile trap is to record how many links, or depths , have passed through the current page. When the maximum attempt is reached, the link in the page is no longer added to the queue, we need to modify the seen variable to a dictionary and increase the record of the page attempt. If you want to disable this feature, simply set the max_depth to a negative number.

Def link_crawler (..., max_depth=2):
    seen = {seed_url:0}
    ...
    depth = Seen[url]
    If depth!= max_depth: for
        link in Links:
            if link not in seen:
                seen[link] = depth + 1< C12/>crawl_queue.append (link)
5. Final version

1.4.4link_crawler4_ultimateversion.py

# coding:utf-8 Import re import urlparse import urllib2 import time from datetime import datetime import Robotparser Impo RT Queue def link_crawler (Seed_url, Link_regex=none, delay=5, Max_depth=-1, Max_urls=-1, Headers=none, user_agent= ' WSWP ', Proxy=none, Num_retries=1: "" "" Crawl from the given seed URL following links matched by Link_regex "" "# th E Queue of URL ' s that still need to is crawled crawl_queue = Queue.deque ([Seed_url]) # The URL ' s that have been SE En and at what depth seen = {seed_url:0} # track how many URL ' s have been downloaded Num_urls = 0 RP = ge T_robots (seed_url) throttle = throttle (delay) headers = headers or {} if user_agent:headers[' User-age NT '] = user_agent while crawl_queue:url = Crawl_queue.pop () # Check URL passes robots.txt restrictio
            NS If Rp.can_fetch (user_agent, URL): #if get_robots (Seed_url): throttle.wait (URL) html = Download (URL, headERs, Proxy=proxy, num_retries=num_retries) links = [] 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.