This article mainly introduces how to use HTMLParser to parse html instances in Python. This article provides examples and summarizes the methods contained in HTMLParser in two categories. one is explicitly called, the other class does not need to display the call. you can refer to a problem encountered in the next few days. you need to pick out a part of the content on the webpage, so you have found two libraries: urllib and HTMLParser. urllib crawls the web page and submits it to HTMLParser for parsing. when you use this library for the first time, you have encountered some problems when checking official documents. here, I will write it down and share it with you.
Example
The code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print "a start tag:", tag, self. getpos ()
Parser = MyHTMLParser ()
Parser. feed ('
"Hello"
')
In this example, HTMLParser is a base class, and the handle_starttag method is overloaded, and some information is output. parser is an instance of MyHTMLParser and calls the feed method to start the parsing function. it is worth noting that the handle_starttag method is executed without being displayed.
I have been confused about the calling method of the HTMLParser method for a long time. after reading many blog posts, I suddenly realized that HTMLParser contains two types of methods: one is explicitly called, the other class does not need to display the call.
Methods that do not require explicit calling
The following functions will be triggered during parsing, but by default, they will not produce any side effects. Therefore, we need to reload them according to our own needs.
1. HTMLParser. handle_starttag (tag, attrs): start tag call is encountered during parsing, as shown in figure
The parameter tag is the tag name. here it is 'P', and attrs is the list of all attributes (name, value) of the tag. here it is [('class', 'para')]
2. HTMLParser. handle_endtag (tag): called when an end tag is encountered. the tag is the tag name.
3. HTMLPars. handle_data (data): called when the content in the middle of the tag is encountered, as shown in figureThe data parameter is the content between open and closed tags. it is worth noting that
...
Is not called at p, but only at p.
Of course there are other functions. we will not introduce them here.
Explicit call method
1. HTMLParser. feed (data): the parameter is the html string to be parsed. after the call, the string is parsed.
2. HTMLParser. getpos (): returns the current row number and offset location, such)
3. HTMLParser. get_starttag_text (): returns the content of the start tag closest to the current location.
After all the content is written, there are still some notes. HTMLParser is just a simple module, and the html parsing function is not perfect. for example, it is impossible to accurately separate tags and "self-closed tags ", see the following code:
The code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print 'In in tag', tag
Def handle_startendtag (self, tag, attrs ):
Print 'In in end tag', tag
Str1 ='
'
Str2 ='
'
Parser = MyHTMLParser ()
Parser. feed (str1) # output "begin tag br"
Parser. feed (str2) # output "begin end br"