Using HTMLParser to parse html instances in Python, pythonhtmlparser
I encountered a problem a few days ago. I had to pick out some content on the webpage, so I found two libraries: urllib and HTMLParser. urllib crawls the web page and submits it to HTMLParser for parsing. When you use this library for the first time, you have encountered some problems when checking official documents. Here, I will write it down and share it with you.
Example
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print "a start tag:", tag, self. getpos ()
Parser = MyHTMLParser ()
Parser. feed ('<div> <p> "hello" </p> </div> ')
In this example, HTMLParser is a base class, And the handle_starttag method is overloaded, and some information is output. parser is an instance of MyHTMLParser and calls the feed method to start the parsing function. it is worth noting that the handle_starttag method is executed without being displayed.
I have been confused about the calling method of the HTMLParser Method for a long time. After reading many blog posts, I suddenly realized that HTMLParser contains two types of Methods: one is explicitly called, the other class does not need to display the call.
Methods that do not require explicit calling
The following functions will be triggered during parsing, but by default, they will not produce any side effects. Therefore, we need to reload them according to our own needs.
1. HTMLParser. handle_starttag (tag, attrs): start tag call is encountered during parsing. For example, <p class = 'para'>. The parameter tag is the tag name. Here it is 'P ', attrs is the list of all attributes (name, value) of the tag. Here is [('class', 'para')]
2. HTMLParser. handle_endtag (tag): called when an end tag is encountered. The tag is the tag name.
3. HTMLPars. handle_data (data): it is called when the content in the middle of the tag is encountered, for example, <style> p {color: blue ;}</style>. The parameter data is the content between open and closed tags. it is worth noting that in the form of <div> <p>... </p> </div> is not called at div, but only at p.
Of course there are other functions. We will not introduce them here.
Explicit call Method
1. HTMLParser. feed (data): the parameter is the html string to be parsed. After the call, the string is parsed.
2. HTMLParser. getpos (): returns the current row number and offset location, such)
3. HTMLParser. get_starttag_text (): returns the content of the start tag closest to the current location.
After all the content is written, there are still some notes. HTMLParser is just a simple module, and the html parsing function is not perfect. For example, it is impossible to accurately separate tags and "self-closed tags ", see the following code:
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print 'in in tag', tag
Def handle_startendtag (self, tag, attrs ):
Print 'in in end tag', tag
Str1 = '<br>'
Str2 = '<br/>'
Parser = MyHTMLParser ()
Parser. feed (str1) # output "begin tag br"
Parser. feed (str2) # output "begin end br"