Htmlparser is a python module used to parse HTML. It can analyze tags and data in HTML, and is a simple way to process HTML. Htmlparser adopts an event-driven mode. When htmlparser finds a specific tag, it calls a user-defined function to notify the program to process it. Its main user callback functions start with handler _ and are all htmlparser member functions. When we use it, we will derive a new class from htmlparser, and then redefine these functions starting with handler. These functions include:
Handle_startendtag Processing start tag and end tag
Handle_starttag start tag processing, such as <XX>
Handle_endtag: process the end tag, for example, </XX>
Handle_charref processes special strings, which start with & # and are generally characters represented by inner codes.
Handle_entityref processes special characters starting with &, such as & nbsp;
Handle_data: the data in the middle of <XX> data </XX>.
Handle_comment
Handle_decl processing <! For example, <! Doctype HTML public "-// W3C // dtd html 4.01 transitional // en"
Handle_pi processing is like <? Instruction>
Here I will introduce how to obtain a URL from a webpage. To obtain the URL, you must analyze the <A> tag and obtain the value of its href attribute. The following code is used:
#-*-Encoding: gb2312 -*-
Import htmlparser
Class myparser (htmlparser. htmlparser ):
Def _ init _ (Self ):
Htmlparser. htmlparser. _ init _ (Self)
Def handle_starttag (self, Tag, attrs ):
# The function for processing the start tag is redefined here.
If tag = 'A ':
# Determining the attributes of a tag <A>
For name, value in attrs:
If name = 'href ':
Print Value
If _ name _ = '_ main __':
A = '<HTML>
My = myparser ()
# Input the data to be analyzed, which is HTML.
My. Feed ()