First glimpse of the web crawler
are used by Python3.
A simple example:
from Import = Urlopen ("http://pythonscraping.com/pages/page1.html") Print(Html.read ())
In Python 2.x, the Urllib2 library, in Python 3.x, Urllib2 renamed Urllib, divided into sub-modules: Urllib.request, Urllib.parse, and Urllib.error.
Two BeautifulSoup 1. Using BeautifulSoup
Note: 1. Install the module via PIP install BEAUTIFULSOUP4
2. Establish a reliable network connection that can handle exceptions that may occur in the program
As the following example:
fromUrllib.errorImportHttperror fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupdefgetTitle (URL):Try: HTML=urlopen (URL)exceptHttperror as E:returnNoneTry: Bsobj=BeautifulSoup (Html.read ()) title=bsobj.body.h1exceptAttributeerror as E:returnNonereturnTitletitle= GetTitle ("http://pythonscraping.com/pages/page1.html")iftitle = =None:Print("title is not found")Else: Print(title)
2. The Web crawler can obtain the specified content through the value of the class attribute
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://pythonscraping.com/pages/warandpeace.html") Bsobj=BeautifulSoup (HTML)#Use the Fillall function to extract a span note with the class attribute red through the Bsobj objectContentList = Bsobj.findall ("span",{"class":"Red"}) forContentinchContentList:Print(Content.get_text ())Print('\ n')
3. Through the navigation tree
fromUrllib.requestImportUrlopen fromBs4Importbeautifulsouphtml= Urlopen ("http://pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML)#Find Child Tags forChildinchBsobj.find ("Table",{"ID":"giftlist"}). Children:Print(Child)#find the brother tag . forSilblinginchBsobj.find ("Table",{"ID":"giftlist"}). Tr.next_siblings:Print(silbling) forH2titleinchBsobj.findall ("H2"): Print(H2title.get_text ())Print(Bsobj.find ("img",{"src":".. /img/gifts/img1.jpg"}). Parent.previous_sibling.get_text ())
5. Regular Expressions and BeautifulSoup
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupImportrehtml= Urlopen ("http://pythonscraping.com/pages/page3.html") Bsobj=BeautifulSoup (HTML)#returns the Dictionary object imagesImages = Bsobj.findall ("img",{"src": Re.compile ("\.\.\/img\/gifts/img.*\.jpg")}) forImageinchImages:Print(image["src"])
The beautifulsoup of "Python Network data Acquisition" notes