Suspected bugs:When SGMLParser is used to process javascript in html tags, it is inappropriate under specific circumstances.
Library:Sgmllib library of Python2.4/2.5
Link database:Beautiful Soup version 3.0.5 and 3.0.3
Example:
The html code is defined as follows: sExceptionHtml = ''' <span> error html Tag: </span> <div id = 'error'>
Onmouseover = "if (this. width> screen. width * 0.7) {this. resized = true; this. width = screen. width * 0.7;
This. style. cursor = 'hand'; this. alt = 'click here to open new window \ nCTRL + Mouse wheel to zoom in/out ';}"
Onload ="If (this. width> screen. width * 0.7){This. resized = true; this. width = screen. width * 0.7;
This. alt = 'click here to open new window \ nCTRL + Mouse wheel to zoom in/out ';}"
/> Cold! <Br/> </div> '''
This img tag has two attributes: onload and onmouseover, which both write javascript code and contain the ">" judgment symbol. When SGMLParser is asked to process such html code, it will incorrectly parse it.
For the above html code, the img after processing will be obtained as follows:
This. style. cursor = 'hand'; this. alt = 'click here to open new window
CTRL + Mouse wheel to zoom in/out ';} "/>Screen. width * 0.7) {this. resized = true; this. width = screen. width * 0.7;
This. style. cursor = 'hand'; this. alt = 'click here to open new window
CTRL + Mouse wheel to zoom in/out ';}"
Onload = "if (this. width> screen. width * 0.7) {this. resized = true; this. width = screen. width * 0.7;
This. alt = 'click here to open new window
CTRL + Mouse wheel to zoom in/out ';}"
/>
Apparently, the onmouseover is messy. It is very likely that ">" in "this. width> screen. width * 0.7" in javascript is mistakenly treated as the ending character of the html Tag.
If this is the case, we can understand it, but we will be affected. We should clear the onload and onmouseover attributes in advance to save javascript interference.
Page_content = re. sub ('onload = \ "\ s * [^ \"] * \ "','', page_content)
Page_content = re. sub ('onmouseover = \ "\ s * [^ \"] * \ "','', page_content)
Implication:This affects Beautiful Soup parsing html.
You can test the following code to reproduce this problem: # coding = UTF-8
Import sys, OS, urllib, re
From sgmllib import SGMLParser
From BeautifulSoup import BeautifulSoup
Def replaceHTMLTag (content ):
Htmlextractor = html2txt ()
# Call the feed method defined in SGMLParser to put the HTML content into the analyzer.
Htmlextractor. feed (content)
# Close your analyzer object for different reasons. The feed method does not guarantee processing of all the HTML data transmitted to it,
# It may buffer it and wait for receiving more content. Once there is no more content, call close to refresh the buffer and force all content to be completely processed.
Htmlextractor. close ()
# The analysis process ends once the analyzer is closed. Htmlextractor. urls contains all the URL links in the HTML document.
Return htmlextractor. text
# To extract data from HTML documents, subclass the SGMLParser class and define the methods of tags or objects to be captured.
Class html2txt (SGMLParser ):
Def _ init _ (self ):
SGMLParser. _ init _ (self)
Self. _ result = []
Self. _ data_stack = []
'''
The reset is called by SGMLParser's _ init _ method. You can also manually call it when creating a analyzer instance.
So if you need initialization, do it in reset, instead of in _ init.
In this way, when someone re-uses a analyzer instance, it will re-initialize it correctly.
'''
Def reset (self ):
Self. text =''
Self. inbody = True
SGMLParser. reset (self)
Def handle_data (self, text ):
If self. inbody:
Self. text + = text
Def _ write (self, d ):
If len (self. _ data_stack) <2:
Target = self. _ result
Else:
Target = self. _ data_stack [-1]
If type (d) in (list, tuple ):
Target + = d
Else:
Target. append (str (d ))
Def start_head (self, text ):
Self. inbody = False
Def end_head (self ):
Self. inbody = True
Def _ get_result (self ):
Return "". join (self. _ result). strip ()
Result = property (_ get_result)
# Application portal
If _ name _ = '_ main __':
SExceptionHtml = ''' <span> error html Tag: </span> <div id = 'error'>
Onmouseover = "if (this. width> screen. width * 0.7) {this. resized = true; this. width = screen. width * 0.7;
This. style. cursor = 'hand'; this. alt = 'click here to open new window \ nCTRL + Mouse wheel to zoom in/out ';}"
Onload = "if (this. width> screen. width * 0.7) {this. resized = true; this. width = screen. width * 0.7;
This. alt = 'click here to open new window \ nCTRL + Mouse wheel to zoom in/out ';}"
/> Cold! <Br/> </div> '''
Soup = BeautifulSoup (sExceptionHtml, fromEncoding = 'gbk ')
Body_content = soup. findAll ('div ', attrs = {'id': re. compile ("^ error ")})
Print '----------------------'
Print body_content [0]
Print '----------------------'
SExceptionHtml = replaceHTMLTag (sExceptionHtml). strip ()
Print '----------------------'
Print sExceptionHtml
Print '-----------------------'
Conclusion:It is not a serious problem. Only when javascript is written in the tag attribute in html code, you need to note this feature. If the ">" symbol appears, it will lead to improper parsing of SGMLParser and other libraries using SGMLParser. Zhengyun 20080115