7th Chapter parsing HTML and XHTML

Last Update:2014-09-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Understanding Basic HTML parsing

Before parsing with the Htmlparser module, it is generally necessary to define a subclass Htmlparser.htmlparser and add functions to handle the different labels. Example:

#!/usr/bin/env python#-*-coding:utf-8-*-import sysfrom htmlparser import  Htmlparserclass titleparser (Htmlparser):     def __init__ (self):         self.title= '          Self.readingtitle=0        htmlparser.__init__ (self) #初始化和重置实例          def handle_starttag (self,tag,attrs):         if tag== ' title ':             self.readingtitle=1    def handle_data (Self,data):         if self.readingtitle:             self.title+=data    def handle_endtag (Self,tag):         if tag== ' title ':            self.readingtitle=0         def gettitle (self):        return  Self.titlefd=open (Sys.argv[1]) Tp=titleparser () Tp.feed (Fd.read ()) print  ' Title is: ', Tp.gettitle () Tp.close ()

The Htmlparser Feed () method calls the Handle_starttag (), Handle_data (), and Handle_endtag () methods appropriately. As I understand it, Htmlparser will parse every layer in the HTML document, and if it encounters a <...>, give handle_starttag () processing of the bracketed content, or </...> () processing, if between the two, then to Handle_data () processing.

In addition, there will be entities in the real HTML. Entities represent normal characters, such as:& for "&". The Handle_entityref () method is called for the processing of the entity. In addition to entities, there are character references that look similar to & #174, and such character references are used to embed characters that cannot be printed. Handling of the character reference calls the Handle_charref () function.

An annoying problem with HTML code is an unbalanced label. Because in HTML, some tags are not required to end. XHTML requires that the left and right labels have an end part. The Mxtidy and utidylib libraries can be used to automatically fix HTML code that is not written in numbers. Example:


#!/usr/bin/env pythonfrom htmlentitydefs import entitydefsfrom htmlparser import  htmlparserimport sys,reclass titleparser (htmlparser):     def __init__ (self):        self.taglevels=[]         self.handledtags=[' title ', ' ul ', ' Li ']         Self.processing=none        htmlparser.__init__ (self)          def handle_starttag (self,tag,attrs):         if len (self.taglevels)  and self.taglevels[-1]==tag:              #Processing  a previous version of  This tag. close it out              #and   Then start a new on this one             Self.handle_endtag (tag)              #Note  that  we ' re now processing this tag         Self.taglevels.append (TAG)                  if tag in self.handledtags:             self.data= '              self.processing=tag            if tag== ' ul ':                 print  ' list started. '     def handle_data (self,data):         if  self.processing:            self.data+=data         def handle_endtag (Self,tag):         if not tag in self.taglevels:             return        while len (self.taglevels):             starttag=self.taglevels.pop ()              if starttag in self.handledtags:                  Self.finishprocessing (Starttag)             if  starttag==tag:                 break    def cLeanse (self):         self.data=re.sub (' \s+ ', '   ', self.data)      def finishprocessing (Self,tag):         Self.cleanse ()         if tag== ' title '  and tag== self.processing:            print  ' Document  title: ', self.data        elif tag== ' ul ':             print  ' list ended. '         elif tag== ' Li '  and tag==self.processing:             print  ' List item: ',self.data         self.processing=none    def handle_ EntityRef (self,name): &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSp;if entitydefs.has_key (name):             Self.handle_data (Entitydefs[name])         else:             self.handle_data (' & ' +name+ '; ')         def handle_charref (Self,name):         try:            charnum= int (name)         except ValueError:             return        if  charnum<1 or charnum>255:             return        self.handle_data (Chr (charnum))          def gettitle (self): &Nbsp;       return self.titlefd=open (Sys.argv[1]) Tp=TitleParser () Tp.feed (Fd.read ())
In the Handle_starttag () function, the system is recorded in the Self.taglevels whenever a start tag appears. If the label is one of three kinds of program processing, the self.processing will also set the tag to notify the system to start recording data. In Handle_endtag (), first check if there is a start tag corresponding to the end tag in the query. If not, it will be skipped. If you do, you will find the nearest one. The self.finishprocessing () function removes the space in the data string and prints the appropriate message. Tag==self.processing ensure that the same data is not used two times.
This article from "Lotus's Thoughts" blog, please be sure to keep this source http://liandesinian.blog.51cto.com/7737219/1556991
7th Chapter parsing HTML and XHTML

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

7th Chapter parsing HTML and XHTML

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

7th Chapter parsing HTML and XHTML

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support