URL Manager module
It is generally used to maintain crawled URLs and URLs that have not been crawled, and if the URL of the current crawl already exists in the queue, there is no need to repeat the crawl, in addition to prevent a dead loop. As an example,
I climbed www.baidu.com. Where I crawl the list has music.baidu.om, and then I continue to crawl all the links of the page, but it contains www.baidu.com, you can imagine that if you do not deal with the word becomes a dead loop, in the Baidu homepage and Baidu Music page cycle, so there is a column to maintain the URL is very important.
The following is an example of Python code implementation, using the Deque bidirectional queue to easily remove the previous URL.
From collections Import Dequeclass Urlqueue (): Def __init__ (self): Self.queue = deque () # pages to crawl SELF.VI sited = set () # The page that has been crawled Def new_url_size (self): "" Gets the size of the collection of crawled URLs: return: " Return Len (self.queue) def old_url_size (self): Gets the size of the crawled URL: return: " Return Len (self.visited) def has_new_url (self): "" To determine if there are url:return that are not crawled: " Return Self.new_url_size ()! = 0 def get_new_url (self): "' Get an url:return that is not crawled: ' New_url = Self.queue.popleft () #从左侧取出一个链接 self.old_urls.add (new_url) #记录已经抓取 return New_url def add_ New_url (self, URL): "" adds a new URL to the collection of URLs that are not crawled:p Aram URL: Single Url:return: " The If URL is none:return False if the URL isn't in Self.new_urls and the URL isn't in self.old_urls:self. New_urls.append (URL)def add_new_urls (self, Urlset): "" adds a new URL to the collection of URLs that are not crawled:p Aram Urlset:url collection: return: "If Urlset is None or len (urlset) = = 0:return for URL in urlset:self.add_new _url (URL)
URL Manager for the
Python crawler module