URL Manager for Python crawler modules

Source: Internet
Author: User

URL Manager module

It is generally used to maintain crawled URLs and URLs that have not been crawled, and if the URL of the current crawl already exists in the queue, there is no need to repeat the crawl, in addition to prevent a dead loop. As an example,

I climbed www.baidu.com. Where I crawl the list has music.baidu.om, and then I continue to crawl all the links of the page, but it contains www.baidu.com, you can imagine that if you do not deal with the word becomes a dead loop, in the Baidu homepage and Baidu Music page cycle, so there is a column to maintain the URL is very important.

The following is an example of Python code implementation, using the Deque bidirectional queue to easily remove the previous URL.

From collections Import Dequeclass Urlqueue (): Def __init__ (self): Self.queue = deque () # pages to crawl SELF.VI        sited = set () # The page that has been crawled Def new_url_size (self): "" Gets the size of the collection of crawled URLs: return: "        Return Len (self.queue) def old_url_size (self): Gets the size of the crawled URL: return: "        Return Len (self.visited) def has_new_url (self): "" To determine if there are url:return that are not crawled: "        Return Self.new_url_size ()! = 0 def get_new_url (self): "' Get an url:return that is not crawled: ' New_url = Self.queue.popleft () #从左侧取出一个链接 self.old_urls.add (new_url) #记录已经抓取 return New_url def add_        New_url (self, URL): "" adds a new URL to the collection of URLs that are not crawled:p Aram URL: Single Url:return: " The If URL is none:return False if the URL isn't in Self.new_urls and the URL isn't in self.old_urls:self.    New_urls.append (URL)def add_new_urls (self, Urlset): "" adds a new URL to the collection of URLs that are not crawled:p Aram Urlset:url collection: return: "If Urlset is None or len (urlset) = = 0:return for URL in urlset:self.add_new   _url (URL)

URL Manager for the

Python crawler module

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.