Python implementation of Width-first traversal crawler

Last Update:2017-03-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawler width first traverse python

Online is a well-known crawler tutorial, "write your own manual web crawler," the book all the source code is written in java,

It refers to the Width-first traversal algorithm, idle To do nothing I put him in Python to realize it again. Less than half of the code, Hehe.

Introduction to Width First algorithm

Reference: http://book.51cto.com/art/201012/236668.htm

The entire Width-first crawler process starts with a series of seed nodes, extracts the "child nodes" (I.E. Hyperlinks) from these pages and fetches them in a Queue. The processed links need to be placed in a table (often called a visited table). Before each new link is processed, you need to see if the link already exists in the visited Table. If it exists, the link has been processed, skipped, not processed, or the next step is Processed.

The initial URL address is the seed URL provided in the crawler system (typically specified in the System's configuration file). When parsing the Web page represented by these seed urls, a new URL is generated (such as extracting http://www.admin.com from the <a href= "http://www.admin.com" in the page). then, take the following work:

(1) Compare the parsed link with the link in the visited table if the link does not exist in the visited table, indicating that it has not been accessed.

(2) put the link into the Todo table.

(3) once the processing is complete, get a link from the Todo table and put it directly into the visited Table.

(4) continue the above process for the Web page indicated by this Link. So Cyclical.

Table 1.3 shows the crawl process for the page shown in Figure 1.3.

Table 1.3 Network Crawling

TODO table	Visited table
A	Empty
Bcdef	A
Cdef	a, b
Def	A,b,c

Continuation table

TODO table	Visited table
Ef	A,b,c,d
FH	A,b,c,d,e
Hg	A,b,c,d,e,f
Gi	A,b,c,d,e,f,h
I	A,b,c,d,e,f,h,g
Empty	A,b,c,d,e,f,h,g,i

Width-first traversal is the most widely used crawler strategy in reptiles, and the reason for using the Width-first search strategy is three points:

Important pages tend to be closer to the seeds, such as when we open news sites that are often the hottest news, and as we continue to surf, the pages we see are becoming less important.

The actual depth of the World Wide Web can reach up to 17 levels, but there is always a short path to a web Page. The Width-first traversal will reach this page at the fastest Speed.

Width priority is beneficial to Multi-crawler cooperative crawl, multi-crawler cooperation is usually first crawl site links, crawling closed very strong.

Python implementation of Width-first traversal crawler

Python code

#encoding =utf-8
From BeautifulSoup import BeautifulSoup
Import socket
Import Urllib2
Import re
Class Mycrawler:
def __init__ (self,seeds):
#使用种子初始化url队列
Self.linkquence=linkquence ()
If Isinstance (seeds,str):
SELF.LINKQUENCE.ADDUNVISITEDURL (seeds)
If Isinstance (seeds,list):
For I in Seeds:
SELF.LINKQUENCE.ADDUNVISITEDURL (i)
Print "Add the seeds URL \"%s\ "to the unvisited url list"%str (self.linkQuence.unVisited)
#抓取过程主函数
def crawling (self,seeds,crawl_count):
#循环条件: the link to be crawled is not empty and the page of the zone is not more than Crawl_count
While Self.linkQuence.unVisitedUrlsEnmpy () is False and Self.linkQuence.getVisitedUrlCount () <=crawl_count:
#队头url出队列
Visiturl=self.linkquence.unvisitedurldequence ()
Print "Pop out one URL \"%s\ "from unvisited url List"%visiturl
If Visiturl is None or visiturl== "":
Continue
#获取超链接
Links=self.gethyperlinks (visiturl)
Print "Get%d new Links"%len (links)
#将url放入已访问的url中
SELF.LINKQUENCE.ADDVISITEDURL (visiturl)
Print "visited URL count:" +str (self.linkQuence.getVisitedUrlCount ())
#未访问的url入列
For link in links:
SELF.LINKQUENCE.ADDUNVISITEDURL (link)
Print "%d unvisited links:"%len (self.linkQuence.getUnvisitedUrl ())
#获取源码中得超链接
def gethyperlinks (self,url):
links=[]
Data=self.getpagesource (url)
If data[0]== "200":
Soup=beautifulsoup (data[1])
A=soup.findall ("a", {"href": re.compile (". *")})
For I in a:
If I["href"].find ("/http")!=-1:
Links.append (i["href"])
Return links
#获取网页源码
def Getpagesource (self,url,timeout=100,coding=none):
Try
Socket.setdefaulttimeout (timeout)
req = urllib2. Request (url)
Req.add_header (' user-agent ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) ')
Response = Urllib2.urlopen (req)
If coding is None:
coding= Response.headers.getparam ("charset")
If coding is None:
Page=response.read ()
Else
Page=response.read ()
Page=page.decode (coding). encode (' utf-8 ')
return ["$", page]
Except Exception,e:
Print str (e)
return [STR (e), None]
Class Linkquence:
def __init__ (self):
#已访问的url集合
self.visted=[]
#待访问的url集合
self.unvisited=[]
#获取访问过的url队列
def Getvisitedurl (self):
Return self.visted
#获取未访问的url队列
def Getunvisitedurl (self):
Return self.unvisited
#添加到访问过得url队列中
def Addvisitedurl (self,url):
Self.visted.append (url)
#移除访问过得url
def Removevisitedurl (self,url):
Self.visted.remove (url)
#未访问过得url出队列
def unvisitedurldequence (self):
Try
return Self.unVisited.pop ()
Except
Return None
#保证每个url只被访问一次
def Addunvisitedurl (self,url):
If url!= "" and URL not in self.visted and URLs not in self.unvisited:
Self.unVisited.insert (0,url)
#获得已访问的url数目
def Getvisitedurlcount (self):
Return Len (self.visted)
#获得未访问的url数目
def Getunvistedurlcount (self):
Return Len (self.unvisited)
#判断未访问的url队列是否为空
def Unvisitedurlsenmpy (self):
Return len (self.unvisited) ==0
def main (seeds,crawl_count):
Craw=mycrawler (seeds)
Craw.crawling (seeds,crawl_count)
If __name__== "__main__":
Main (["http://www.baidu.com", "http://www.google.com.hk"],50)

Python implementation of Width-first traversal crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implementation of Width-first traversal crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python implementation of Width-first traversal crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support