Also refer to the online tutorial, but also a few of the HTML tags will be reviewed a bit
At the same time Amway a website, I only joined a Community official website (Web Development Association
Www.nutjs.com
The former president belongs to the existence of the Daniel class, the website has been reconstructed many times, the peanut is too hot.
OK, so I used this site to do the next exercise
ImportReImporturllib.requestImportUrllib fromCollectionsImportDequequeue=deque () visited=set () URL='http://www.nutjs.com/'#Initial crawl Sitequeue.append (URL) CNT= 0#Crawl Web Counter whileQueue#Queue Loop BFs CrawlURL =Queue.popleft () visited|= {URL}#deduplication to prevent repetitive crawls Print('Crawling:'+URL) CNT+=1Urlop=urllib.request.urlopen (URL)if 'HTML' not inchUrlop.getheader ('Content-type'):Continue #filter out the text needed for legal Try: Data= Urlop.read (). Decode ('Utf-8') except: ContinueLinkre= Re.compile ('href=\ "(. +?) \"') forXinchLinkre.findall (data):#print (x) if 'http' inchX andX not inchvisited:queue.append (x)
The results are as follows:
The second Python crawler