Tag: While DIA ref page result time href port text
Read oreilly.web.scraping.with.python.2015.6 Notes---Find all the href in the page
1. Find all the text beginning with <a>, and then determine if the href is in <a>, if the <a> has an href, like <a href= "", then extract the value of the href.
From urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("Http://en.wikipedia.org/wiki/Kevin_ Bacon ") Bsobj = BeautifulSoup (HTML) for link in Bsobj.findall (" a "): if ' href ' in link.attrs: print (link.attrs[' href '])
Operation Result:
Locate the source code in the Web page:
2. Extracting text that begins with/wiki/
From urllib.request import urlopenfrom bs4 import Beautifulsoupimport rehtml = Urlopen ("http://en.wikipedia.org/wiki/ Kevin_bacon ") Bsobj = BeautifulSoup (HTML," lxml ") for link in Bsobj.find (" div ", {" id ":" bodycontent "}). FindAll (" A ", href= Re.compile ("^ (/wiki/) (?!:).) *$ ")): if ' href ' in link.attrs: print (link.attrs[' href '])
Operation Result:
3. A serial of text that extracts different pages to start with/wiki
From urllib.request import urlopenfrom bs4 import beautifulsoupimport datetimeimport randomimport rerandom.seed ( Datetime.datetime.now ()) def getlinks (articleurl): html = urlopen ("http://en.wikipedia.org" +articleurl) Bsobj = BeautifulSoup (HTML, "lxml") return Bsobj.find ("div", {"id": "bodycontent"}). FindAll ("A", Href=re.compile ( ^ (/wiki/) ((?!:).) *$ ")) links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0: newarticle = links[random.randint (0, Len (links) -1)].attrs["href"] print (newarticle) links = getlinks (newarticle)
Operation Result:
After running for a period of time, will be error: The remote host forced to shut down an existing connection, this is the site denied program connection?
Read oreilly.web.scraping.with.python.2015.6 Notes---Find all the href in the page