A few days ago encountered a problem, need to pick out a part of the content of the Web page, so found the Urllib and Htmlparser two libraries. Urllib can crawl the Web page down, then to Htmlparser resolution, the first use of this library, in the search of official documents also encountered some problems, Write it down here to share with you.
An example
Copy Code code as follows:
From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self, Tag, attrs):
Print "A start tag:", Tag,self.getpos ()
Parser=myhtmlparser ()
Parser.feed (' <div><p> ' hello ' </p></div> ')
In this example, Htmlparser is the base class, overloading his Handle_starttag method and outputting some information. Parser is an instance of Myhtmlparser, invoking the feed method to begin parsing the function. It is noteworthy that no display calls are required Handle_ The Starttag method is executed.
Htmlparser method of calling way puzzled me for a long time, saw a lot of Bovencai suddenly dawned, Htmlparser contains methods are divided into two categories, one needs to be explicitly called, and the other type does not need to display the call.
Methods that do not need to be called explicitly
The following functions are triggered during parsing, but by default they do not produce any side effects, so we are overloaded according to our requirements.
1.htmlparser.handle_starttag (TAG,ATTRS): Parsing encountered a start tag call, such as <p class= ' para ', the parameter tag is the tag name, here is ' P ', attrs for all attributes of the label (name, Value) list, here is [(' class ', ' para ')]
2.htmlparser.handle_endtag (TAG): Call when the end tag is encountered, tag is the sign
3.htmlpars.handle_data (data): Called when the content in the middle of the label is encountered, such as <style> p {Color:blue} </STYLE>, the parameter data is the content between the opening and closing tabs. It is noteworthy that in the position of <div><p>...</p></div>, it is not called at the Div, but only at p
Of course, there are other functions, which are not introduced here.
Methods that are called explicitly
1.htmlparser.feed (data): parameter is an HTML string that needs to be parsed, and the string begins to be parsed when called
2.htmlparser.getpos (): Returns the current line number and offset position, such as (23,5)
3.htmlparser.get_starttag_text (): Returns the contents of the nearest start tag for the current position
All the content finished, and finally a little note, Htmlparser is just a simple module, parsing HTML function is not perfect, for example, can not accurately open the label and "Auto closed tag", look at the following code:
Copy Code code as follows:
From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self,tag,attrs):
print ' begin tag ', tag
def handle_startendtag (self,tag,attrs):
print ' Begin end tag ', tag
str1= ' <br> '
Str2= ' <br/> '
Parser=myhtmlparser ()
Parser.feed (str1) # output "Begin tag BR"
Parser.feed (str2) # output "Begin end BR"