1. A simple Web Access
Urllib is a standard Python library (meaning that you don't need to install any attachments to run the demo), including a way to request data over the network, process cookies, and even change metadata such as headers and user agents.
Urlopen This method is used to access remote data over the network, which is to send a GET request to the established URL address.
1 from Import Urlopen 2 html = urlopen ('http://pythonscraping.com/pages/page1.html')3 Print(Html.read ())
View Code
2. Brief introduction of BeautifulSoup
This module is a html/xml parser that can handle non-canonical tags and generate parse trees very well.
I Installing BeautifulSoup
Linux:apt-get Install PYTHON-BS4 or pip install Beautifulsoup4
II in this module (BS4) The BeautifulSoup is the most commonly used
1 fromUrllib.requestImportUrlopen2 fromBs4ImportBeautifulSoup3 4html = Urlopen ("http://www.pythonscraping.com/pages/page1.html")5Bsobj = BeautifulSoup (Html.read (),'Html.parser')6 Print(BSOBJ.H1)7 8 #output: View Code
3. Exception Handling
I did not find the requested address on the service an HTTP error error will be returned this error may be 404 Page not found Internal server error All of these conditions Urlopen method throws an exception Httperror
1 Try:2html = Urlopen ("http://www.pythonscraping.com/exercises/exercise1.html")3 exceptHttperror as E:4 Print(e)5 #return Null,break or do some other ' plan B '6 Else:7 #Program continues
II Service not found, Urlopen will return none
1 if is None: 2 Print ("URL is not found") 3 Else : 4 # Program continues
A notation for capturing various exceptions:
1 fromUrllib.requestImportUrlopen2 fromUrllib.errorImportHttperror3 fromBs4ImportBeautifulSoup4 defgetTitle (URL):5 Try:6HTML =urlopen (URL)7 exceptHttperror as E:8 returnNone9 Try:TenBsobj =BeautifulSoup (Html.read ()) Onetitle =bsObj.body.h1 A exceptAttributeerror as E: - returnNone - returntitle thetitle = GetTitle ("http://www.pythonscraping.com/exercises/exercise1.html") - iftitle = =None: - Print("Title could not being found") - Else: + Print(title)
Python crawler (i)