Htmlparser is the module that Python uses to parse HTML. It can analyze the tags in HTML, data, etc., is a simple way to deal with HTML. Htmlparser uses an event-driven pattern that, when Htmlparser finds a particular tag, invokes a user-defined function that notifies the program to handle it. Its main user callback functions are named after the Handler_, all of which are htmlparser member functions. When we use it, we derive a new class from the Htmlparser and redefine the functions that begin with Handler_. These functions include the following:
Handle_startendtag processing start and end tags
Handle_starttag processing start tags, such as <xx>
Handle_endtag processing end tags, such as </xx>
Handle_charref processing of special strings, that is, & #开头的, usually the character in the inner code
Handle_entityref deals with special characters that begin with &, such as
Handle_data processing data is the data in the middle of <xx>data</xx>
Handle_comment Processing Notes
Handle_decl deal with <! at the beginning, such as <! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
HANDLE_PI deal with things like <?instruction>
Here I take the URL from the Web page for example, introduced. To get to the URL, be sure to parse the <a> tag and then fetch the value of its href attribute. Here's the code:
#-*-encoding:gb2312-*-
import Htmlparser
class Myparser (Htmlparser.htmlparser):
def __init__ (self):
htmlparser.htmlparser.__init__ (self)
def handle_starttag (self, Tag, attrs):
# This defines the function that handles the start tag
if Tag = = ' A ':
# <a> properties for
name,value in attrs:
if name = = ' href ':
print value
if __name__ = = ' __main__ ':
a = '