Using HTMLParser to parse html instances in Python, pythonhtmlparser

Source: Internet
Author: User

Using HTMLParser to parse html instances in Python, pythonhtmlparser

I encountered a problem a few days ago. I had to pick out some content on the webpage, so I found two libraries: urllib and HTMLParser. urllib crawls the web page and submits it to HTMLParser for parsing. When you use this library for the first time, you have encountered some problems when checking official documents. Here, I will write it down and share it with you.

Example
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print "a start tag:", tag, self. getpos ()
Parser = MyHTMLParser ()
Parser. feed ('<div> <p> "hello" </p> </div> ')

In this example, HTMLParser is a base class, And the handle_starttag method is overloaded, and some information is output. parser is an instance of MyHTMLParser and calls the feed method to start the parsing function. it is worth noting that the handle_starttag method is executed without being displayed.

I have been confused about the calling method of the HTMLParser Method for a long time. After reading many blog posts, I suddenly realized that HTMLParser contains two types of Methods: one is explicitly called, the other class does not need to display the call.

Methods that do not require explicit calling

The following functions will be triggered during parsing, but by default, they will not produce any side effects. Therefore, we need to reload them according to our own needs.

1. HTMLParser. handle_starttag (tag, attrs): start tag call is encountered during parsing. For example, <p class = 'para'>. The parameter tag is the tag name. Here it is 'P ', attrs is the list of all attributes (name, value) of the tag. Here is [('class', 'para')]

2. HTMLParser. handle_endtag (tag): called when an end tag is encountered. The tag is the tag name.

3. HTMLPars. handle_data (data): it is called when the content in the middle of the tag is encountered, for example, <style> p {color: blue ;}</style>. The parameter data is the content between open and closed tags. it is worth noting that in the form of <div> <p>... </p> </div> is not called at div, but only at p.

Of course there are other functions. We will not introduce them here.

Explicit call Method

1. HTMLParser. feed (data): the parameter is the html string to be parsed. After the call, the string is parsed.

2. HTMLParser. getpos (): returns the current row number and offset location, such)

3. HTMLParser. get_starttag_text (): returns the content of the start tag closest to the current location.

After all the content is written, there are still some notes. HTMLParser is just a simple module, and the html parsing function is not perfect. For example, it is impossible to accurately separate tags and "self-closed tags ", see the following code:
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print 'in in tag', tag
Def handle_startendtag (self, tag, attrs ):
Print 'in in end tag', tag

Str1 = '<br>'
Str2 = '<br/>'
Parser = MyHTMLParser ()

Parser. feed (str1) # output "begin tag br"
Parser. feed (str2) # output "begin end br"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.