Python HTMLParser module parses html to get url instances, pythonhtmlparser
HTMLParser is a python module used to parse html. It can analyze tags and data in html, and is a simple way to process html. HTMLParser adopts an event-driven mode. When HTMLParser finds a specific tag, it calls a user-defined function to notify the program to process it. Its main user callback functions start with handler _ and are all HTMLParser member functions. When we use it, we will derive a new class from HTMLParser, and then redefine these functions starting with handler. These functions include:
Handle_startendtag Processing start tag and end tag
Handle_starttag start tag processing, such as <xx>
Handle_endtag: process the end tag, for example, </xx>
Handle_charref processes special strings, which start with & # and are generally characters represented by inner codes.
Handle_entityref processes special characters starting with "&", such
Handle_data: the data in the middle of <xx> data </xx>.
Handle_comment
Handle_decl processing <! For example, <! DOCTYPE html PUBLIC "-// W3C // dtd html 4.01 Transitional // EN"
Handle_pi processing is like <? Instruction>
Here I will introduce how to obtain a url from a webpage. To obtain the url, you must analyze the <a> tag and obtain the value of its href attribute. The following code is used:
#-*-Encoding: gb2312-*-import HTMLParserclass MyParser (HTMLParser. HTMLParser): def _ init _ (self): HTMLParser. HTMLParser. _ init _ (self) def handle_starttag (self, tag, attrs): # The function for processing the start tag is redefined here. if tag = 'A ': # determine the attributes of the tag <a> for name, value in attrs: if name = 'href ': print value if _ name _ = '_ main __': a = '