Understanding Basic HTML parsing
Before parsing with the Htmlparser module, it is generally necessary to define a subclass Htmlparser.htmlparser and add functions to handle the different labels. Example:
#!/usr/bin/env python#-*-coding:utf-8-*-import sysfrom htmlparser import Htmlparserclass titleparser (Htmlparser): def __init__ (self): self.title= ' Self.readingtitle=0 htmlparser.__init__ (self) #初始化和重置实例 def handle_starttag (self,tag,attrs): if tag== ' title ': self.readingtitle=1 def handle_data (Self,data): if self.readingtitle: self.title+=data def handle_endtag (Self,tag): if tag== ' title ': self.readingtitle=0 def gettitle (self): return Self.titlefd=open (Sys.argv[1]) Tp=titleparser () Tp.feed (Fd.read ()) print ' Title is: ', Tp.gettitle () Tp.close ()
The Htmlparser Feed () method calls the Handle_starttag (), Handle_data (), and Handle_endtag () methods appropriately. As I understand it, Htmlparser will parse every layer in the HTML document, and if it encounters a <...>, give handle_starttag () processing of the bracketed content, or </...> () processing, if between the two, then to Handle_data () processing.
In addition, there will be entities in the real HTML. Entities represent normal characters, such as:& for "&". The Handle_entityref () method is called for the processing of the entity. In addition to entities, there are character references that look similar to & #174, and such character references are used to embed characters that cannot be printed. Handling of the character reference calls the Handle_charref () function.
An annoying problem with HTML code is an unbalanced label. Because in HTML, some tags are not required to end. XHTML requires that the left and right labels have an end part. The Mxtidy and utidylib libraries can be used to automatically fix HTML code that is not written in numbers. Example:
#!/usr/bin/env pythonfrom htmlentitydefs import entitydefsfrom htmlparser import htmlparserimport sys,reclass titleparser (htmlparser): def __init__ (self): self.taglevels=[] self.handledtags=[' title ', ' ul ', ' Li '] Self.processing=none htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): if len (self.taglevels) and self.taglevels[-1]==tag: #Processing a previous version of This tag. close it out #and Then start a new on this one Self.handle_endtag (tag) #Note that we ' re now processing this tag Self.taglevels.append (TAG) if tag in self.handledtags: self.data= ' self.processing=tag if tag== ' ul ': print ' list started. ' def handle_data (self,data): if self.processing: self.data+=data def handle_endtag (Self,tag): if not tag in self.taglevels: return while len (self.taglevels): starttag=self.taglevels.pop () if starttag in self.handledtags: Self.finishprocessing (Starttag) if starttag==tag: break def cLeanse (self): self.data=re.sub (' \s+ ', ' ', self.data) def finishprocessing (Self,tag): Self.cleanse () if tag== ' title ' and tag== self.processing: print ' Document title: ', self.data elif tag== ' ul ': print ' list ended. ' elif tag== ' Li ' and tag==self.processing: print ' List item: ',self.data self.processing=none def handle_ EntityRef (self,name): &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;if entitydefs.has_key (name): Self.handle_data (Entitydefs[name]) else: self.handle_data (' & ' +name+ '; ') def handle_charref (Self,name): try: charnum= int (name) except ValueError: return if charnum<1 or charnum>255: return self.handle_data (Chr (charnum)) def gettitle (self): &Nbsp; return self.titlefd=open (Sys.argv[1]) Tp=TitleParser () Tp.feed (Fd.read ())
In the Handle_starttag () function, the system is recorded in the Self.taglevels whenever a start tag appears. If the label is one of three kinds of program processing, the self.processing will also set the tag to notify the system to start recording data. In Handle_endtag (), first check if there is a start tag corresponding to the end tag in the query. If not, it will be skipped. If you do, you will find the nearest one. The self.finishprocessing () function removes the space in the data string and prints the appropriate message. Tag==self.processing ensure that the same data is not used two times.
This article from "Lotus's Thoughts" blog, please be sure to keep this source http://liandesinian.blog.51cto.com/7737219/1556991
7th Chapter parsing HTML and XHTML