Using HTMLParser to parse html instances in Python

Source: Internet
Author: User
This article mainly introduces how to use HTMLParser to parse html instances in Python. This article provides examples and summarizes the methods contained in HTMLParser in two categories. one is explicitly called, the other class does not need to display the call. you can refer to a problem encountered in the next few days. you need to pick out a part of the content on the webpage, so you have found two libraries: urllib and HTMLParser. urllib crawls the web page and submits it to HTMLParser for parsing. when you use this library for the first time, you have encountered some problems when checking official documents. here, I will write it down and share it with you.

Example

The code is as follows:


From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print "a start tag:", tag, self. getpos ()
Parser = MyHTMLParser ()
Parser. feed ('

"Hello"

')


In this example, HTMLParser is a base class, and the handle_starttag method is overloaded, and some information is output. parser is an instance of MyHTMLParser and calls the feed method to start the parsing function. it is worth noting that the handle_starttag method is executed without being displayed.

I have been confused about the calling method of the HTMLParser method for a long time. after reading many blog posts, I suddenly realized that HTMLParser contains two types of methods: one is explicitly called, the other class does not need to display the call.

Methods that do not require explicit calling

The following functions will be triggered during parsing, but by default, they will not produce any side effects. Therefore, we need to reload them according to our own needs.

1. HTMLParser. handle_starttag (tag, attrs): start tag call is encountered during parsing, as shown in figure

The parameter tag is the tag name. here it is 'P', and attrs is the list of all attributes (name, value) of the tag. here it is [('class', 'para')]

2. HTMLParser. handle_endtag (tag): called when an end tag is encountered. the tag is the tag name.

3. HTMLPars. handle_data (data): called when the content in the middle of the tag is encountered, as shown in figureThe data parameter is the content between open and closed tags. it is worth noting that

...

Is not called at p, but only at p.

Of course there are other functions. we will not introduce them here.

Explicit call method

1. HTMLParser. feed (data): the parameter is the html string to be parsed. after the call, the string is parsed.

2. HTMLParser. getpos (): returns the current row number and offset location, such)

3. HTMLParser. get_starttag_text (): returns the content of the start tag closest to the current location.

After all the content is written, there are still some notes. HTMLParser is just a simple module, and the html parsing function is not perfect. for example, it is impossible to accurately separate tags and "self-closed tags ", see the following code:

The code is as follows:


From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print 'In in tag', tag
Def handle_startendtag (self, tag, attrs ):
Print 'In in end tag', tag

Str1 ='
'
Str2 ='
'
Parser = MyHTMLParser ()

Parser. feed (str1) # output "begin tag br"
Parser. feed (str2) # output "begin end br"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.