Read oreilly.web.scraping.with.python.2015.6 Notes---Crawl
1. The function calls itself, so that a loop is formed and a loop is set:
From urllib.request import urlopenfrom bs4 import Beautifulsoupimport repages = set () def getlinks (pageurl): Global page s HTML = urlopen ("http://en.wikipedia.org" +pageurl) bsobj = BeautifulSoup (HTML, "lxml") Try:print (bsobj.h1 . Get_text ()) print (Bsobj.find (id = "Mw-content-text"). FindAll ("P") [0])//Find page in Id=mw-content-tex T, and then find "P" on this basis the contents of this label [0] represent the selection of the No. 0 print (Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href '])//Find To id=ca-edit inside the span tag inside the A tag inside the value of the href except Attributeerror:print ("This page is missing something! No worries though! ") For link in Bsobj.findall ("A", Href=re.compile ("^ (/wiki/)")): if ' href ' in link.attrs:if link.attrs[' hr EF '] Not in pages: #We has encountered a new page newPage = link.attrs[' href '] Print (NewPage) pages.add (newPage) getlinks (newPage) getlinks ("")
2. Processing the URL, using "/" to split the characters in the URL
def splitaddress (address): addressparts = Address.replace ("/", "" "). Split ("/") return addresspartsaddr = Splitaddress ("https://hao.360.cn/?a1004") print (addr)
The result of the operation is:
Runfile (' c:/users/user/desktop/chensimin.py ', wdir= ' c:/users/user/desktop ') [' https: ', ' ', ' hao.360.cn ', '? a1004 '] //two///no content, used ' means '
def splitaddress (address): addressparts = Address.replace ("/", "" "). Split ("/") return addresspartsaddr = Splitaddress ("http://www.autohome.com.cn/wuhan/#pvareaid =100519") print (addr)
The result of the operation is:
Runfile (' c:/users/user/desktop/chensimin.py ', wdir= ' c:/users/user/desktop ') [' www.autohome.com.cn ', ' Wuhan ', ' # pvareaid=100519 ']
Read oreilly.web.scraping.with.python.2015.6 Notes---Crawl