question No. 0009: An HTML file to find the link inside.
Idea: For extracting hyperlinks in Web pages, it is more convenient to read the content of the webpage first and then use BeautifulSoup to parse it. But I found a problem, if directly extract the A-tag href, will contain javascript:xxx and #xxx and so on, so the special treatment of these.
0009. Extract hyperlinks from Web pages. py
#!/usr/bin/env python#coding: Utf-8 fromBs4ImportBeautifulSoupImportUrllibImportUrllib2ImportSysreload (SYS) sys.setdefaultencoding ("Utf-8")# URL of the page to parseURL =' http://www.ruanyifeng.com/blog/2015/05/co.html ' def findalllink(URL): " extract hyperlinks from Web pages " # Get protocol, domain nameProto, rest = urllib.splittype (URL) domain = urllib.splithost (rest) [0]# Read Web contenthtml = urllib2.urlopen (URL). Read ()# Extract HyperlinksA = BeautifulSoup (HTML). FINDALL (' A ')# filterAlist = [i.attrs[' href '] forIinchAifi.attrs[' href '][0] !=' J ']# Fill the shape as #comment-text anchor points into http://www.ruanyifeng.com/blog/2015/05/co.html, will be shaped like/feed.html complement all for HTTP. Www.ruanyifeng.com/feed.htmlAlist = Map (LambdaI:proto +'://'+ domain + Iifi[0] =='/' ElseURL + Iifi[0] ==' # ' ElseI, Alist)returnAlistif__name__ = =' __main__ ': forIinchFindalllink (URL):PrintI
Take Ruan Yi Feng blog on an article test, the effect is as follows:
Python Show-me-the-code No. 0009 extract hyperlinks in Web pages