Use HTMLParser to parse html
The HTMLParser in python parses html, which is different from the html library parsing in c ++ and other languages. It uses class inheritance.
By re-implementing several functions of the HTMLParser class, we can parse html.
Major heavy-duty functions include:
Handle_starttag # Start Tag Parsing
Handle_endtag # End Tag Parsing
Handle_data # parsing of tag data
The following is an example of how to use it (this example is an example on the python homepage ):
from html.parser import HTMLParserclass MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Encountered an end tag :", tag) def handle_data(self, data): print("Encountered some data :", data)parser = MyHTMLParser()parser.feed('Test' 'Parse me!')
The source html is:
Test Parse me!
Output result:
Encountered a start tag: htmlEncountered a start tag: headEncountered a start tag: titleEncountered some data : TestEncountered an end tag : titleEncountered an end tag : headEncountered a start tag: bodyEncountered a start tag: h1Encountered some data : Parse me!Encountered an end tag : h1Encountered an end tag : bodyEncountered an end tag : html
Now the TAG content can be parsed.
Summary:
1) inherit the HTMLParser class
Class MYParser (HTMLParser ):
2) def handle_starttag (self, tag, attrs) # redefines the start tag of resolution. The tag is a tag, and attrs is the tag attribute and attribute value: it is a dict.
# Here is an example to extract the web site
3) def handle_endtag (self, tag): # redefine the resolution end tag
4) def handle_data (self, data): # redefine the parsing data