Python implementation of Width-first traversal crawler

Source: Internet
Author: User

Crawler width first traverse python

Online is a well-known crawler tutorial, "write your own manual web crawler," the book all the source code is written in java,

It refers to the Width-first traversal algorithm, idle To do nothing I put him in Python to realize it again. Less than half of the code, Hehe.

Introduction to Width First algorithm

Reference: http://book.51cto.com/art/201012/236668.htm

The entire Width-first crawler process starts with a series of seed nodes, extracts the "child nodes" (I.E. Hyperlinks) from these pages and fetches them in a Queue. The processed links need to be placed in a table (often called a visited table). Before each new link is processed, you need to see if the link already exists in the visited Table. If it exists, the link has been processed, skipped, not processed, or the next step is Processed.

The initial URL address is the seed URL provided in the crawler system (typically specified in the System's configuration file). When parsing the Web page represented by these seed urls, a new URL is generated (such as extracting http://www.admin.com from the <a href= "http://www.admin.com" in the page). then, take the following work:

(1) Compare the parsed link with the link in the visited table if the link does not exist in the visited table, indicating that it has not been accessed.

(2) put the link into the Todo table.

(3) once the processing is complete, get a link from the Todo table and put it directly into the visited Table.

(4) continue the above process for the Web page indicated by this Link. So Cyclical.

Table 1.3 shows the crawl process for the page shown in Figure 1.3.

Table 1.3 Network Crawling

TODO table

Visited table

A

Empty

Bcdef

A

Cdef

a, b

Def

A,b,c

Continuation table

TODO table

Visited table

Ef

A,b,c,d

FH

A,b,c,d,e

Hg

A,b,c,d,e,f

Gi

A,b,c,d,e,f,h

I

A,b,c,d,e,f,h,g

Empty

A,b,c,d,e,f,h,g,i

Width-first traversal is the most widely used crawler strategy in reptiles, and the reason for using the Width-first search strategy is three points:

Important pages tend to be closer to the seeds, such as when we open news sites that are often the hottest news, and as we continue to surf, the pages we see are becoming less important.

The actual depth of the World Wide Web can reach up to 17 levels, but there is always a short path to a web Page. The Width-first traversal will reach this page at the fastest Speed.

Width priority is beneficial to Multi-crawler cooperative crawl, multi-crawler cooperation is usually first crawl site links, crawling closed very strong.

Python implementation of Width-first traversal crawler

Python code
  1. #encoding =utf-8
  2. From BeautifulSoup import BeautifulSoup
  3. Import socket
  4. Import Urllib2
  5. Import re
  6. Class Mycrawler:
  7. def __init__ (self,seeds):
  8. #使用种子初始化url队列
  9. Self.linkquence=linkquence ()
  10. If Isinstance (seeds,str):
  11. SELF.LINKQUENCE.ADDUNVISITEDURL (seeds)
  12. If Isinstance (seeds,list):
  13. For I in Seeds:
  14. SELF.LINKQUENCE.ADDUNVISITEDURL (i)
  15. Print "Add the seeds URL \"%s\ "to the unvisited url list"%str (self.linkQuence.unVisited)
  16. #抓取过程主函数
  17. def crawling (self,seeds,crawl_count):
  18. #循环条件: the link to be crawled is not empty and the page of the zone is not more than Crawl_count
  19. While Self.linkQuence.unVisitedUrlsEnmpy () is False and Self.linkQuence.getVisitedUrlCount () <=crawl_count:
  20. #队头url出队列
  21. Visiturl=self.linkquence.unvisitedurldequence ()
  22. Print "Pop out one URL \"%s\ "from unvisited url List"%visiturl
  23. If Visiturl is None or visiturl== "":
  24. Continue
  25. #获取超链接
  26. Links=self.gethyperlinks (visiturl)
  27. Print "Get%d new Links"%len (links)
  28. #将url放入已访问的url中
  29. SELF.LINKQUENCE.ADDVISITEDURL (visiturl)
  30. Print "visited URL count:" +str (self.linkQuence.getVisitedUrlCount ())
  31. #未访问的url入列
  32. For link in links:
  33. SELF.LINKQUENCE.ADDUNVISITEDURL (link)
  34. Print "%d unvisited links:"%len (self.linkQuence.getUnvisitedUrl ())
  35. #获取源码中得超链接
  36. def gethyperlinks (self,url):
  37. links=[]
  38. Data=self.getpagesource (url)
  39. If data[0]== "200":
  40. Soup=beautifulsoup (data[1])
  41. A=soup.findall ("a", {"href": re.compile (". *")})
  42. For I in a:
  43. If I["href"].find ("/http")!=-1:
  44. Links.append (i["href"])
  45. Return links
  46. #获取网页源码
  47. def Getpagesource (self,url,timeout=100,coding=none):
  48. Try
  49. Socket.setdefaulttimeout (timeout)
  50. req = urllib2. Request (url)
  51. Req.add_header (' user-agent ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) ')
  52. Response = Urllib2.urlopen (req)
  53. If coding is None:
  54. coding= Response.headers.getparam ("charset")
  55. If coding is None:
  56. Page=response.read ()
  57. Else
  58. Page=response.read ()
  59. Page=page.decode (coding). encode (' utf-8 ')
  60. return ["$", page]
  61. Except Exception,e:
  62. Print str (e)
  63. return [STR (e), None]
  64. Class Linkquence:
  65. def __init__ (self):
  66. #已访问的url集合
  67. self.visted=[]
  68. #待访问的url集合
  69. self.unvisited=[]
  70. #获取访问过的url队列
  71. def Getvisitedurl (self):
  72. Return self.visted
  73. #获取未访问的url队列
  74. def Getunvisitedurl (self):
  75. Return self.unvisited
  76. #添加到访问过得url队列中
  77. def Addvisitedurl (self,url):
  78. Self.visted.append (url)
  79. #移除访问过得url
  80. def Removevisitedurl (self,url):
  81. Self.visted.remove (url)
  82. #未访问过得url出队列
  83. def unvisitedurldequence (self):
  84. Try
  85. return Self.unVisited.pop ()
  86. Except
  87. Return None
  88. #保证每个url只被访问一次
  89. def Addunvisitedurl (self,url):
  90. If url!= "" and URL not in self.visted and URLs not in self.unvisited:
  91. Self.unVisited.insert (0,url)
  92. #获得已访问的url数目
  93. def Getvisitedurlcount (self):
  94. Return Len (self.visted)
  95. #获得未访问的url数目
  96. def Getunvistedurlcount (self):
  97. Return Len (self.unvisited)
  98. #判断未访问的url队列是否为空
  99. def Unvisitedurlsenmpy (self):
  100. Return len (self.unvisited) ==0
  101. def main (seeds,crawl_count):
  102. Craw=mycrawler (seeds)
  103. Craw.crawling (seeds,crawl_count)
  104. If __name__== "__main__":
  105. Main (["http://www.baidu.com", "http://www.google.com.hk"],50)

Python implementation of Width-first traversal crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.