The usual idea when thinking about "web crawlers":
? Get HTML data from a website domain name
? Parsing data based on target information
? Storing target information
? If necessary, move to another page to repeat the process
When a Web browser encounters a tag, such as
1, the first Knowledge urllib library
Urllib is the standard library, in Python 3.x, Urllib2 renamed Urllib, divided into sub-modules: Urllib.request, Urllib.parse and Urllib.error.
The Urlopen is used to open and read a remote object obtained from the network.
Import Urlopen, and then call Html.read () to get the HTML content of the Web page.
>>> from urllib.request import urlopen>>> html = urlopen ( "Http://pythonscraping.com/pages/page1.html") >>> print (Html.read ()) B '
2. Using Virtual Environment
You can save a library file with a virtual environment.
As follows: Create a new environment called scrapingenv, then activate it, in the new scrapingenv environment, you can install and use BeautifulSoup, and finally exit the environment by releasing the command.
$ virtualenv scrapingenv$ cd scrapingenv/$ source Bin/activate (scrapingenv) ryan$ pip install Beautifulsoup4 (scrapingenv ) ryan$ python> from BS4 import beautifulsoup> (scrapingenv) ryan$ deactivate$
You can use any of the following commands to install the BeautifulSoup library
PIP3 Install BS4PIP3 Install BEAUTIFULSOUP4
3, the first Knowledge BeautifulSoup library
The HTML content is uploaded to the BeautifulSoup object (Html.parser is a built-in parser), and the
>>> from urllib.request import urlopen>>> from BS4 import beautifulsoup>>> html = Urlopen ("htt P://pythonscraping.com/pages/page1.html ") >>> bsobj = BeautifulSoup (Html.read (), ' Html.parser ') >> > Print (BSOBJ.H1)
* All of the following function calls can produce the same result:
Bsobj.h1
BsObj.html.body.h1
BsObj.body.h1
BsObj.html.h1
4. Handling Exceptions
The Urlopen function throws an "Httperror" exception if the page does not exist on the server (or if there is an error getting the page).
If the server does not exist (that is, the link cannot be opened or the URL is incorrectly written), Urlopen returns a None object.
Handle it in the following ways:
try:html = Urlopen ("http://pythonscraping.com/pages/page1.html") except Httperror as E:print (e) # interrupts the program, or executes another scenario Else:if HTML is none:print ("URLs are not found") Else: # program continues pass
Calling a label in the BeautifulSoup object does not exist to return the None object.
The Attributeerror error occurs when you call the sub-label below the None object.
Handle it in the following ways:
Try:bsobj = BeautifulSoup (Html.read (), ' html.parser ') badcontent = bsObj.body.h2except Attributeerror as E:prin T ("tag is not found") else:if badcontent = = None:print ("tag is not Found") Else:print (badcontent)
5. Re-organize the above code
# -*- coding: utf-8 -*-from urllib.request import urlopenfrom Urllib.error import httperrorfrom bs4 import beautifulsoupdef gettitle (URL): try: html = urlopen (URL) except httperror as e: return None try: bsobj = beautifulsoup (Html.read (), ' Html.parser ') title = bsObj.body.h1 except attributeerror as e: return none return title title = gettitle ("http ://pythonscraping.com/pages/page1.html ") IF&NBsp;title == none: print ("Title could not be found") Else: print (title)
We created a gettitle function that returns the title of the page. If you encounter a problem when you get a webpage, return a None object.
Inside the GetTitle function, we checked the httperror as before, and then encapsulated the two lines of BeautifulSoup code inside a try statement. Any row in either row has a problem, Attributeerror can be thrown (if the server does not exist, HTML is a None object, Html.read () will throw Attributeerror).
"Python Network data Acquisition" Reading notes (i)