This example describes how Python extracts a list of URLs within a page. Share to everyone for your reference. The implementation method is as follows:
From BS4 import Beautifulsoupimport time,re,urllib2t=time.time () websiteurls={}def scanpage (URL): websiteurl=url t=time.time () n=0 html=urllib2.urlopen (websiteurl). Read () soup=beautifulsoup (HTML) Pageurls=[] upageurls={} pageurls=soup.find_all ("a", href=true) for links in Pageurls: if Websiteurl in Links.get ("href") and Links.get ("href") is not in Upageurls and Links.get ("href") not in Websiteurls: Upag Eurls[links.get ("href")]=0 for links in Upageurls.keys (): try: Urllib2.urlopen (links). GetCode () except: print "Connect Failed" else: t2=time.time () Upageurls[links]=urllib2.urlopen ( Links). GetCode () print n, print links, print upageurls[links] t1=time.time () print T1-t2 n+=1 Print ("Total is" +REPR (n) + "links") print Time.time ()-tscanpage ("http://news.163.com/")
Hopefully this article will help you with Python programming.