7th Chapter parsing HTML and XHTML

Source: Internet
Author: User

Understanding Basic HTML parsing

Before parsing with the Htmlparser module, it is generally necessary to define a subclass Htmlparser.htmlparser and add functions to handle the different labels. Example:

#!/usr/bin/env python#-*-coding:utf-8-*-import sysfrom htmlparser import  Htmlparserclass titleparser (Htmlparser):     def __init__ (self):         self.title= '          Self.readingtitle=0        htmlparser.__init__ (self) #初始化和重置实例          def handle_starttag (self,tag,attrs):         if tag== ' title ':             self.readingtitle=1    def handle_data (Self,data):         if self.readingtitle:             self.title+=data    def handle_endtag (Self,tag):         if tag== ' title ':            self.readingtitle=0         def gettitle (self):        return  Self.titlefd=open (Sys.argv[1]) Tp=titleparser () Tp.feed (Fd.read ()) print  ' Title is: ', Tp.gettitle () Tp.close ()

The Htmlparser Feed () method calls the Handle_starttag (), Handle_data (), and Handle_endtag () methods appropriately. As I understand it, Htmlparser will parse every layer in the HTML document, and if it encounters a <...>, give handle_starttag () processing of the bracketed content, or </...> () processing, if between the two, then to Handle_data () processing.

In addition, there will be entities in the real HTML. Entities represent normal characters, such as:&amp; for "&". The Handle_entityref () method is called for the processing of the entity. In addition to entities, there are character references that look similar to & #174, and such character references are used to embed characters that cannot be printed. Handling of the character reference calls the Handle_charref () function.

An annoying problem with HTML code is an unbalanced label. Because in HTML, some tags are not required to end. XHTML requires that the left and right labels have an end part. The Mxtidy and utidylib libraries can be used to automatically fix HTML code that is not written in numbers. Example:

#!/usr/bin/env pythonfrom htmlentitydefs import entitydefsfrom htmlparser import  htmlparserimport sys,reclass titleparser (htmlparser):     def __init__ (self):        self.taglevels=[]         self.handledtags=[' title ', ' ul ', ' Li ']         Self.processing=none        htmlparser.__init__ (self)          def handle_starttag (self,tag,attrs):         if len (self.taglevels)  and self.taglevels[-1]==tag:              #Processing  a previous version of  This tag. close it out              #and   Then start a new on this one             Self.handle_endtag (tag)              #Note  that  we ' re now processing this tag         Self.taglevels.append (TAG)                  if tag in self.handledtags:             self.data= '              self.processing=tag            if tag== ' ul ':                 print  ' list started. '     def handle_data (self,data):         if  self.processing:            self.data+=data         def handle_endtag (Self,tag):         if not tag in self.taglevels:             return        while len (self.taglevels):             starttag=self.taglevels.pop ()              if starttag in self.handledtags:                  Self.finishprocessing (Starttag)             if  starttag==tag:                 break    def cLeanse (self):         self.data=re.sub (' \s+ ', '   ', self.data)      def finishprocessing (Self,tag):         Self.cleanse ()         if tag== ' title '  and tag== self.processing:            print  ' Document  title: ', self.data        elif tag== ' ul ':             print  ' list ended. '         elif tag== ' Li '  and tag==self.processing:             print  ' List item: ',self.data         self.processing=none    def handle_ EntityRef (self,name): &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;if entitydefs.has_key (name):             Self.handle_data (Entitydefs[name])         else:             self.handle_data (' & ' +name+ '; ')         def handle_charref (Self,name):         try:            charnum= int (name)         except ValueError:             return        if  charnum<1 or charnum>255:             return        self.handle_data (Chr (charnum))          def gettitle (self): &Nbsp;       return self.titlefd=open (Sys.argv[1]) Tp=TitleParser () Tp.feed (Fd.read ())

In the Handle_starttag () function, the system is recorded in the Self.taglevels whenever a start tag appears. If the label is one of three kinds of program processing, the self.processing will also set the tag to notify the system to start recording data. In Handle_endtag (), first check if there is a start tag corresponding to the end tag in the query. If not, it will be skipped. If you do, you will find the nearest one. The self.finishprocessing () function removes the space in the data string and prints the appropriate message. Tag==self.processing ensure that the same data is not used two times.

This article from "Lotus's Thoughts" blog, please be sure to keep this source http://liandesinian.blog.51cto.com/7737219/1556991

7th Chapter parsing HTML and XHTML

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.