Htmlparser is a python module that is easy to use and can easily analyze HTML files.
This article briefly introduces the usage of htmlparser.
During use, you need to define a class inherited from the class htmlparser. Redefinition function:
- Handle_starttag (TAG, attrs)
- Handle_startendtag (TAG, attrs)
- Handle_endtag (TAG)
To implement the functions you need.
Tag: HTML Tag
Attrs is the list of (attribute, value) tuples (tuple ).
For example, a tag is: <input type = "hidden" name = "nxx" id = "idxx" value = "vxx"/>
Then its attrs list is [('type', 'hiddy'), ('name', 'nxx'), ('id', 'idxx '), ('value', 'vxx')]
Htmlparser automatically converts both tag and attrs into lowercase letters.
The following example extracts all links in HTML:
Code
From Htmlparser Import Htmlparser
Class Myhtmlparser (htmlparser ):
Def _ Init __ (Self ):
Htmlparser. _ Init __ (Self)
Self. Links = []
Def Handle_starttag (self, Tag, attrs ):
# Print "encountered the beginning of a % s tag" % tag
If Tag = " A " :
If Len (attrs) = 0: Pass
Else :
For (Variable, value) In Attrs:
If Variable = " Href " :
Self. Links. append (value)
If _ Name __ = " _ Main __ " :
Html_code = """
<A href = "www.google.com"> Google.com </a>
<A href = "www.pythonclub.org"> pythonclub </a>
<A href = "www.sina.com.cn"> Sina </a>
"""
HP = Myhtmlparser ()
HP. Feed (html_code)
HP. Close ()
Print (HP. Links)
Output:
['Www .google.com ', 'www .pythonclub.org', 'www .sina.com.cn ']
If you want to extract a Graphic Link
The handle_startendtag (TAG, attrs) function must be redefined.