Crawler width first traverse python
Online is a well-known crawler tutorial, "write your own manual web crawler," the book all the source code is written in java,
It refers to the Width-first traversal algorithm, idle To do nothing I put him in Python to realize it again. Less than half of the code, Hehe.
Introduction to Width First algorithm
Reference: http://book.51cto.com/art/201012/236668.htm
The entire Width-first crawler process starts with a series of seed nodes, extracts the "child nodes" (I.E. Hyperlinks) from these pages and fetches them in a Queue. The processed links need to be placed in a table (often called a visited table). Before each new link is processed, you need to see if the link already exists in the visited Table. If it exists, the link has been processed, skipped, not processed, or the next step is Processed.
The initial URL address is the seed URL provided in the crawler system (typically specified in the System's configuration file). When parsing the Web page represented by these seed urls, a new URL is generated (such as extracting http://www.admin.com from the <a href= "http://www.admin.com" in the page). then, take the following work:
(1) Compare the parsed link with the link in the visited table if the link does not exist in the visited table, indicating that it has not been accessed.
(2) put the link into the Todo table.
(3) once the processing is complete, get a link from the Todo table and put it directly into the visited Table.
(4) continue the above process for the Web page indicated by this Link. So Cyclical.
Table 1.3 shows the crawl process for the page shown in Figure 1.3.
Table 1.3 Network Crawling
TODO table |
Visited table |
A |
Empty |
Bcdef |
A |
Cdef |
a, b |
Def |
A,b,c |
Continuation table
TODO table |
Visited table |
Ef |
A,b,c,d |
FH |
A,b,c,d,e |
Hg |
A,b,c,d,e,f |
Gi |
A,b,c,d,e,f,h |
I |
A,b,c,d,e,f,h,g |
Empty |
A,b,c,d,e,f,h,g,i |
Width-first traversal is the most widely used crawler strategy in reptiles, and the reason for using the Width-first search strategy is three points:
Important pages tend to be closer to the seeds, such as when we open news sites that are often the hottest news, and as we continue to surf, the pages we see are becoming less important.
The actual depth of the World Wide Web can reach up to 17 levels, but there is always a short path to a web Page. The Width-first traversal will reach this page at the fastest Speed.
Width priority is beneficial to Multi-crawler cooperative crawl, multi-crawler cooperation is usually first crawl site links, crawling closed very strong.
Python implementation of Width-first traversal crawler
Python code
- #encoding =utf-8
- From BeautifulSoup import BeautifulSoup
- Import socket
- Import Urllib2
- Import re
- Class Mycrawler:
- def __init__ (self,seeds):
- #使用种子初始化url队列
- Self.linkquence=linkquence ()
- If Isinstance (seeds,str):
- SELF.LINKQUENCE.ADDUNVISITEDURL (seeds)
- If Isinstance (seeds,list):
- For I in Seeds:
- SELF.LINKQUENCE.ADDUNVISITEDURL (i)
- Print "Add the seeds URL \"%s\ "to the unvisited url list"%str (self.linkQuence.unVisited)
- #抓取过程主函数
- def crawling (self,seeds,crawl_count):
- #循环条件: the link to be crawled is not empty and the page of the zone is not more than Crawl_count
- While Self.linkQuence.unVisitedUrlsEnmpy () is False and Self.linkQuence.getVisitedUrlCount () <=crawl_count:
- #队头url出队列
- Visiturl=self.linkquence.unvisitedurldequence ()
- Print "Pop out one URL \"%s\ "from unvisited url List"%visiturl
- If Visiturl is None or visiturl== "":
- Continue
- #获取超链接
- Links=self.gethyperlinks (visiturl)
- Print "Get%d new Links"%len (links)
- #将url放入已访问的url中
- SELF.LINKQUENCE.ADDVISITEDURL (visiturl)
- Print "visited URL count:" +str (self.linkQuence.getVisitedUrlCount ())
- #未访问的url入列
- For link in links:
- SELF.LINKQUENCE.ADDUNVISITEDURL (link)
- Print "%d unvisited links:"%len (self.linkQuence.getUnvisitedUrl ())
- #获取源码中得超链接
- def gethyperlinks (self,url):
- links=[]
- Data=self.getpagesource (url)
- If data[0]== "200":
- Soup=beautifulsoup (data[1])
- A=soup.findall ("a", {"href": re.compile (". *")})
- For I in a:
- If I["href"].find ("/http")!=-1:
- Links.append (i["href"])
- Return links
- #获取网页源码
- def Getpagesource (self,url,timeout=100,coding=none):
- Try
- Socket.setdefaulttimeout (timeout)
- req = urllib2. Request (url)
- Req.add_header (' user-agent ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) ')
- Response = Urllib2.urlopen (req)
- If coding is None:
- coding= Response.headers.getparam ("charset")
- If coding is None:
- Page=response.read ()
- Else
- Page=response.read ()
- Page=page.decode (coding). encode (' utf-8 ')
- return ["$", page]
- Except Exception,e:
- Print str (e)
- return [STR (e), None]
- Class Linkquence:
- def __init__ (self):
- #已访问的url集合
- self.visted=[]
- #待访问的url集合
- self.unvisited=[]
- #获取访问过的url队列
- def Getvisitedurl (self):
- Return self.visted
- #获取未访问的url队列
- def Getunvisitedurl (self):
- Return self.unvisited
- #添加到访问过得url队列中
- def Addvisitedurl (self,url):
- Self.visted.append (url)
- #移除访问过得url
- def Removevisitedurl (self,url):
- Self.visted.remove (url)
- #未访问过得url出队列
- def unvisitedurldequence (self):
- Try
- return Self.unVisited.pop ()
- Except
- Return None
- #保证每个url只被访问一次
- def Addunvisitedurl (self,url):
- If url!= "" and URL not in self.visted and URLs not in self.unvisited:
- Self.unVisited.insert (0,url)
- #获得已访问的url数目
- def Getvisitedurlcount (self):
- Return Len (self.visted)
- #获得未访问的url数目
- def Getunvistedurlcount (self):
- Return Len (self.unvisited)
- #判断未访问的url队列是否为空
- def Unvisitedurlsenmpy (self):
- Return len (self.unvisited) ==0
- def main (seeds,crawl_count):
- Craw=mycrawler (seeds)
- Craw.crawling (seeds,crawl_count)
- If __name__== "__main__":
- Main (["http://www.baidu.com", "http://www.google.com.hk"],50)
Python implementation of Width-first traversal crawler