Python crawler Advanced Features

Source: Internet
Author: User

In the previous article, we introduced the crawler implementation, and crawler crawling data function, here will encounter a few problems, such as the site robots.txt files, there are no crawling URLs, there are bots support agent function, and some Web site on the crawler's wind control measures, the design of crawler download speed limit function.
1, analytic robots.txt
First, we need to parse the robots.txt file to avoid downloading URLs that are not allowed to crawl. This work can be done easily with Python's Robotparser module, as in the following code.
The Robotparser module first loads the robots.txt file and then uses the Can_fetch () function to determine whether the specified user agent allows access to the Web page.

def get_robots(url):    """Initialize robots parser for this domain    """    rp = robotparser.RobotFileParser()    ‘/robots.txt‘))    rp.read()    return rp

In order to integrate this functionality into the crawler, we need to add this check in the crawl loop.

while crawl_queue:    url = crawl_queue.pop()    # check url passes robots.txt restrictions    if rp.can_fetch(user_agent,url):        ...    else:        ‘Blocked by robots.txt:‘,url

2. Support Agent
Sometimes we need to use a proxy to access a website. Netflix, for example, has blocked most countries outside the United States. Using the URLLIB2 support agent is not as easy as it might be (try using the more friendly Python HTTP module requests to do this) below is the code that uses the URLLIB2 support agent.
Proxy = ...
Opener = Urllib2.build_opener ()
Proxy_params = {urlparse.urlparse (URL). Scheme:proxy}
Opener.add_header (URLLIB2. Proxyhandler (Proxy_params))
Response = Opener.open (Request)
The following is a new version of the download function that integrates this functionality.

 def download(URL, headers, proxy, num_retries, Data=none):    Print ' downloading: ', url request = Urllib2. Request (URL, data, headers) opener = Urllib2.build_opener ()ifProxy:proxy_params = {urlparse.urlparse (URL). Scheme:proxy} opener.add_handler (Urllib2. Proxyhandler (Proxy_params))Try: Response = Opener.open (request) HTML = response.read () code = Response.codeexceptUrllib2. Urlerror asE:Print ' Download error: ', E.reason HTML ="'        ifHasattr (E,' Code '): Code = E.codeifNum_retries >0  and  -<= Code < -:# retry 5XX HTTP Errorshtml = download (URL, headers, proxy, num_retries-1, data)Else: Code =None    returnHtml

3. Download speed limit
If we crawl the site too quickly, we are at risk of being blocked or overloading the server. To reduce these risks, we can add a delay between two downloads to speed up the crawler. The following is the code for the class that implements the feature.

 class throttle:    "" " throttle downloading by sleeping between requests to same domain " ""     def __init__(self, delay):        # Amount of delay between downloads for each domainSelf.delay = delay# Timestamp of when a domain is last accessedSelf.domains = {} def wait(self, url):        "" " Delay if any accessed this domain recently" ""Domain = urlparse.urlsplit (URL). Netloc last_accessed = self.domains.get (domain)ifSelf.delay >0  andLast_accessed is  not None: Sleep_secs = Self.delay-(DateTime.Now ()-last_accessed). secondsifSleep_secs >0: Time.sleep (sleep_secs) Self.domains[domain] = DateTime.Now ()
Throttle类记录了每个域名上次访问的时间,如果当前时间距离上次访问时间小于指定延时,则执行睡眠操作。我们可以在每次下载之前调用Throttle对爬虫进行限速。

4. Avoid crawler traps
Currently, our crawler keeps track of all links that have not been visited before. However, some sites dynamically generate page content, so that there will be an unlimited number of pages. For example, the site has an online calendar feature that provides access to the next one or next year links, then next month's page will also have access to the next month's link, so that the page will be endless links. This situation becomes a crawler trap.
An easy way to avoid falling into a crawler trap is to accumulate the number of links that reach the current page, which is the depth. When the maximum depth is reached, the crawler no longer adds links in the page to the column. To implement this function, we need to modify the seen variable. The variable originally recorded only the page links that were visited and is now modified to a dictionary, adding a record of the page depth.
Def link_crawler (..., max_length = 2):
Max_length = 2
seen = {}
...
depth = Seen[url]
If depth! = max_depth:
For link in Links:
If link not in seen:
Seen[link] = depth + 1
Crawl_queue.qppend (link)
Now with this feature, we are confident that the crawler will eventually be able to complete. If you want to disable this feature, you only need to set max_depth to a negative number, at which point the current depth will never be equal.
Final version

ImportReImportUrlparseImportUrllib2ImportTime fromDatetimeImportDatetimeImportRobotparserImportQueue fromScrape_callback3ImportScrapecallback def link_crawler(Seed_url, Link_regex=none, delay=5, max_depth=-1, Max_ urls=-1, Headers=none, user_agent=' wswp ', Proxy=none, num_retries=1, scrape_cal Lback=none):    "" "Crawl from the given seed URL following links matched by Link_regex" ""    # The queue of URLs ' s that still need to be crawledCrawl_queue = [Seed_url]# The URL ' s that has been seen and at what depthseen = {seed_url:0}# Track How many URL ' s has been downloadedNum_urls =0RP = Get_robots (seed_url) throttle = throttle (delay) headers = headersor{}ifuser_agent:headers[' User-agent '] = User_agent whileCrawl_queue:url = Crawl_queue.pop () depth = Seen[url]# Check URL passes robots.txt restrictions        ifRp.can_fetch (user_agent, URL): throttle.wait (url) html = download (URL, headers, proxy=proxy, Num_ret ries=num_retries) links = []ifScrape_callback:links.extend (scrape_callback (URL, HTML)or[])ifDepth! = max_depth:# can still crawl further                ifLink_regex:# Filter for links matching our regular expressionLinks.extend (link forLinkinchGet_links (HTML)ifRe.match (Link_regex, link)) forLinkinchLinks:link = normalize (seed_url, link)# Check whether already crawled this link                    ifLink not inchSeen:seen[link] = depth +1                        # Check link is within same domain                        ifSame_domain (Seed_url, link):# success! Add this new link to queueCrawl_queue.append (link)# Check whether has reached downloaded maximumNum_urls + =1            ifNum_urls = = Max_urls: Break        Else:Print ' Blocked by robots.txt: 'Url class throttle:    "" " throttle downloading by sleeping between requests to same domain " ""     def __init__(self, delay):        # Amount of delay between downloads for each domainSelf.delay = delay# Timestamp of when a domain is last accessedSelf.domains = {} def wait(self, url):        "" " Delay if any accessed this domain recently" ""Domain = urlparse.urlsplit (URL). Netloc last_accessed = self.domains.get (domain)ifSelf.delay >0  andLast_accessed is  not None: Sleep_secs = Self.delay-(DateTime.Now ()-last_accessed). secondsifSleep_secs >0: Time.sleep (sleep_secs) Self.domains[domain] = DateTime.Now () def download(URL, headers, proxy, num_retries, Data=none):    Print ' downloading: ', url request = Urllib2. Request (URL, data, headers) opener = Urllib2.build_opener ()ifProxy:proxy_params = {urlparse.urlparse (URL). Scheme:proxy} opener.add_handler (Urllib2. Proxyhandler (Proxy_params))Try: Response = Opener.open (request) HTML = response.read () code = Response.codeexceptUrllib2. Urlerror asE:Print ' Download error: ', E.reason HTML ="'        ifHasattr (E,' Code '): Code = E.codeifNum_retries >0  and  -<= Code < -:# retry 5XX HTTP Errorshtml = download (URL, headers, proxy, num_retries-1, data)Else: Code =None    returnHtml def normalize(seed_url, link):    "" " Normalize this URL by removing hash and adding domain " ""Link, _ = Urlparse.urldefrag (link)# Remove Hash to avoid duplicates    returnUrlparse.urljoin (Seed_url, link) def same_domain(URL1, Url2):    "" " Return True if both URL ' s belong to same domain " ""    returnUrlparse.urlparse (URL1). Netloc = = Urlparse.urlparse (url2). Netloc def get_robots(URL):    "" " Initialize robots parser for this domain" "RP = Robotparser. Robotfileparser () rp.set_url (Urlparse.urljoin (URL,'/robots.txt ')) Rp.read ()returnRp def get_links(HTML):    "" " Return a list of links from html" ""    # A regular expression to extract all links from the webpageWebpage_regex = Re.compile (' <a[^>]+href=[' (. *?) ["\ '] ', Re. IGNORECASE)# list of all links from the webpage    returnWebpage_regex.findall (HTML)if__name__ = =' __main__ ':# Link_crawler (' http://example.webscraping.com ', '/(Index|view) ', delay=0, Num_retries=1, user_agent= ' Badcrawler ')    # Link_crawler (' http://example.webscraping.com ', '/(Index|view) ', delay=0, Num_retries=1, Max_depth=1,    # user_agent= ' Goodcrawler ')Link_crawler (' http://fund.eastmoney.com ',R '/fund.html#os_0;isall_0;ft_;pt_1 ', max_depth=-1, Scrape_callback=scrapecallback

Feel good, let's play a small part of it ~

Python crawler Advanced Features

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.