Read oreilly.web.scraping.with.python.2015.6 Notes---Find all the href in the page

Source: Internet
Author: User

Tag: While DIA ref page result time href port text

Read oreilly.web.scraping.with.python.2015.6 Notes---Find all the href in the page

1. Find all the text beginning with <a>, and then determine if the href is in <a>, if the <a> has an href, like <a href= "", then extract the value of the href.

From urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("Http://en.wikipedia.org/wiki/Kevin_ Bacon ") Bsobj = BeautifulSoup (HTML) for link in Bsobj.findall (" a "):    if ' href ' in link.attrs:        print (link.attrs[' href '])

Operation Result:

Locate the source code in the Web page:

2. Extracting text that begins with/wiki/

From urllib.request import urlopenfrom bs4 import Beautifulsoupimport rehtml = Urlopen ("http://en.wikipedia.org/wiki/ Kevin_bacon ") Bsobj = BeautifulSoup (HTML," lxml ") for link in Bsobj.find (" div ", {" id ":" bodycontent "}). FindAll (" A ", href= Re.compile ("^ (/wiki/) (?!:).) *$ ")):    if ' href ' in link.attrs:        print (link.attrs[' href '])

Operation Result:

3. A serial of text that extracts different pages to start with/wiki

From urllib.request import urlopenfrom bs4 import beautifulsoupimport datetimeimport randomimport rerandom.seed ( Datetime.datetime.now ()) def getlinks (articleurl):    html = urlopen ("http://en.wikipedia.org" +articleurl)    Bsobj = BeautifulSoup (HTML, "lxml")    return Bsobj.find ("div", {"id": "bodycontent"}). FindAll ("A", Href=re.compile ( ^ (/wiki/) ((?!:).) *$ ")) links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0:    newarticle = links[random.randint (0, Len (links) -1)].attrs["href"]    print (newarticle)    links = getlinks (newarticle)

Operation Result:

After running for a period of time, will be error: The remote host forced to shut down an existing connection, this is the site denied program connection?

Read oreilly.web.scraping.with.python.2015.6 Notes---Find all the href in the page

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.