1. Understanding Urllib
Urllib is a standard library of Python that provides rich functions such as requesting data from a Web server, processing cookies, and corresponding URLLIB2 libraries in Python2, unlike Urllib2, Python3 Urllib is divided into several sub-modules: Urllib.request, Urllib.parse, Urllib.error, etc., the use of Urllib Library can refer to https://docs.python.org/3/ Library/urllib.html
fromimport= urlopen("http://pythonscraping.com/pages/page1.html")print(html.read())
b‘
2. Understanding BeautifulSoupThe BeautifulSoup library is used to parse HTML text and convert it into a BeautifulSoup object.
fromimport urlopenfromimport= urlopen("http://www.pythonscraping.com/pages/page1.html"= BeautifulSoup(html.read(),"lxml")print(bsObj.h1)
The BeautifulSoup function requires a parsing library, and the following table lists several common analytic libraries and gives advantages and disadvantages:
Parse Library |
How to use |
Advantages |
Disadvantage |
Python Standard library |
BeautifulSoup (HTML, ' Html.parser ') |
Python built-in standard library, fast execution speed |
Poor fault-tolerant capability |
lxml HTML Parsing Library |
BeautifulSoup (HTML, ' lxml ') |
Fast, high fault tolerance |
Need to install, need C language Library |
lxml XML Parsing Library |
BeautifulSoup (html,[' lxml ', ' xml ') |
Fast, strong fault tolerance; Support XML format |
Requires C language Library |
Htm5lib Parsing Library |
BeautifulSoup (HTML, ' Htm5llib ') |
Browser-based parsing, best-in-the-way fault tolerance |
Slow speed |
3. Reliability CrawlerWe know that the 404 Page not Found is usually present in the site visit, or the server is temporarily shut down, an exception is thrown when the Urlopen function is called, so the program cannot continue, and we can urllib.error the module to handle the exception.
fromimport urlopenfromimport URLErrortry: = urlopen("https://www.baid.com/") #url is wrongexceptas e: print(e)
<urlopen error [Errno 111] Connection refused>
After obtaining a reliable connection, we use BeautifulSoup to process the HTML, which usually results in a situation where a label cannot be found after the site is revised to throw an exception.
fromimport urlopenfromimport= urlopen("http://www.pythonscraping.com/pages/page1.html")try: = BeautifulSoup(html.read(),"lxml") = bsObj.ul.li print(li)exceptAttributeErroras e: print(e)
‘NoneType‘ object has no attribute ‘li‘
4. First Reptile Program fromUrllib.requestImportUrlopen fromUrllib.errorImportHttperror fromBs4ImportBeautifulSoupdefGetTitle (URL):Try: HTML=Urlopen (URL)exceptHttperror asE:return None Try: Bsobj=BeautifulSoup (Html.read (),"lxml") title=BsObj.body.h1except Attributeerror asE:return None returnTitletitle=GetTitle ("Http://www.pythonscraping.com/pages/page1.html")ifTitle== None:Print("Title could not being found.")Else:Print(title)
Web scraping with Python chapter I.