In Python, there are three libraries that can parse HTML text, Htmlparser,sgmllib,htmllib. They do not implement the method, but the function is similar. The classes that provide parsing HTML in these three libraries are base classes and do not work in their own right. After they have discovered the components (such as tags, annotations, reputations, and so on), they call the corresponding functions that must be overloaded because they are not processed in the base class.
Like what:
"" <p>the <a href= "http://ietf.org" >IETF admonishes:
<i>be strict in what <b>send</b>.</i></a></p>
<form>
<input type=submit > <input type=text name=start size=4></form>
</body>"""
If the data is processed, the Handle_starttag function is invoked for Htmlparser when the
Below is a detailed introduction to the next few library 1, Htmlparser
#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import htmlparser,sys,os,string html = ""
The output of this function:
/html/body/p >> the
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> be strict in what
/html/body/p/a/i/b >> Send
/html/body/p/a/i >>.
For some pages, there may not be a strict start to end tag pairs, at which point we can go to ignore some tags. You can write a stack yourself to handle these tags.
#*---------------Tagstack Class Example-----------------# class Tagstack:def __init__ (self, lst=[]): Self.lst = lst def __getitem__ (Self, POS): Return Self.lst[pos] def append (self, tag): # Remove every paragraph-level tag if it is one if Tag.lower () in (' P ', ' blockquote '): Self.lst = [t-T in Self.lst if T-not in (' P ', ' blockquote ')] self.lst.append (tag) de F pop (self, Tag): # "Pop" by tag from nearest POS, not only last item self.lst.reverse () Try:pos = Self.lst.index (tag) ex Cept valueerror:raise htmlparser.htmlparseerror, "Tag not on stack" del Self.lst[pos] self.lst.reverse () Tagstack = Tagst ACK ()
Htmlparser has a bug that can't handle Chinese attributes, for example, if there is a paragraph in the page:
<input type=submit value= Jump to >
Then parsing to this line will make an error.
The wrong reason or the regular expression of the trouble.
Attrfind = Re.compile (
R '/s* ([a-za-z_][-.:a-za-z_0-9]*) (/s*=/s* '
R ' (/' [^/']*/' |] [^"]*"| [-a-za-z0-9./,:;+*%?! &$/(/) _#=~@]*))
Attrfind does not match Chinese characters.
You can change this match to fix this error. Sgmllib This error is not present.
2, Sgmllib
The HTML format is a subset of the SGML format, so SGML can handle a lot of things, and here's a snippet of code to sample Sgmllib usage.
#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import sgmllib,sys,os,string html = "" ;lala>
Output:
Start tag:Start tag:<title>
/lala >> Advice
End Tag:</title>
End Tag:Start tag:<body>
Start tag:<p>
/lala >> the
Start tag:<a>
/lala >> IETF admonishes:
Start tag:<i>
/lala >> be strict in what
Start tag:<b>
/lala >> Send
End Tag:</b>
/lala >>.
End Tag:</i>
End Tag:</a>
End Tag:</p>
Start tag:<form>
Start tag:<input>
/lala >>υ
Start tag:<input>
End Tag:</form>
End Tag:</body>
End Tag:</lala>
As with Htmlparser, if you want to parse HTML with sgmllib, you inherit sgmllib. Sgmlparser class, the functions in this class are empty and the user needs to overload it. The function provided by this class is to invoke the corresponding function in a particular case.
For example, when the
SGML labels are customizable, such as defining a Start_lala function, and then processing the <lala> tag.
There is a place to be explained, if the Start_tagname function is defined, and the Handle_starttag function is defined, then the function will only run the Handle_starttag function, and start_tagname null function is not a problem. If the Handle_starttag function is not defined, the Start_tagname function is run when the <tagname> label is encountered. If the tagname start function is not defined, the label is an unknown label and the Unknown_starttag function is called