Use Htmlparser to parse HTML instances in Python _python

Source: Internet
Author: User
Tags tag name in python

A few days ago encountered a problem, need to pick out a part of the content of the Web page, so found the Urllib and Htmlparser two libraries. Urllib can crawl the Web page down, then to Htmlparser resolution, the first use of this library, in the search of official documents also encountered some problems, Write it down here to share with you.

An example

Copy Code code as follows:

From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self, Tag, attrs):
Print "A start tag:", Tag,self.getpos ()
Parser=myhtmlparser ()
Parser.feed (' <div><p> ' hello ' </p></div> ')

In this example, Htmlparser is the base class, overloading his Handle_starttag method and outputting some information. Parser is an instance of Myhtmlparser, invoking the feed method to begin parsing the function. It is noteworthy that no display calls are required Handle_ The Starttag method is executed.

Htmlparser method of calling way puzzled me for a long time, saw a lot of Bovencai suddenly dawned, Htmlparser contains methods are divided into two categories, one needs to be explicitly called, and the other type does not need to display the call.

Methods that do not need to be called explicitly

The following functions are triggered during parsing, but by default they do not produce any side effects, so we are overloaded according to our requirements.

1.htmlparser.handle_starttag (TAG,ATTRS): Parsing encountered a start tag call, such as <p class= ' para ', the parameter tag is the tag name, here is ' P ', attrs for all attributes of the label (name, Value) list, here is [(' class ', ' para ')]

2.htmlparser.handle_endtag (TAG): Call when the end tag is encountered, tag is the sign

3.htmlpars.handle_data (data): Called when the content in the middle of the label is encountered, such as <style> p {Color:blue} </STYLE>, the parameter data is the content between the opening and closing tabs. It is noteworthy that in the position of <div><p>...</p></div>, it is not called at the Div, but only at p

Of course, there are other functions, which are not introduced here.

Methods that are called explicitly

1.htmlparser.feed (data): parameter is an HTML string that needs to be parsed, and the string begins to be parsed when called

2.htmlparser.getpos (): Returns the current line number and offset position, such as (23,5)

3.htmlparser.get_starttag_text (): Returns the contents of the nearest start tag for the current position

All the content finished, and finally a little note, Htmlparser is just a simple module, parsing HTML function is not perfect, for example, can not accurately open the label and "Auto closed tag", look at the following code:

Copy Code code as follows:

From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self,tag,attrs):
print ' begin tag ', tag
def handle_startendtag (self,tag,attrs):
print ' Begin end tag ', tag

str1= ' <br> '
Str2= ' <br/> '
Parser=myhtmlparser ()

Parser.feed (str1) # output "Begin tag BR"
Parser.feed (str2) # output "Begin end BR"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.