Using HTMLParser to parse html instances in Python, pythonhtmlparser

Last Update:2015-02-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I encountered a problem a few days ago. I had to pick out some content on the webpage, so I found two libraries: urllib and HTMLParser. urllib crawls the web page and submits it to HTMLParser for parsing. When you use this library for the first time, you have encountered some problems when checking official documents. Here, I will write it down and share it with you.

Example
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print "a start tag:", tag, self. getpos ()
Parser = MyHTMLParser ()
Parser. feed ('<div> <p> "hello" </p> </div> ')

In this example, HTMLParser is a base class, And the handle_starttag method is overloaded, and some information is output. parser is an instance of MyHTMLParser and calls the feed method to start the parsing function. it is worth noting that the handle_starttag method is executed without being displayed.

I have been confused about the calling method of the HTMLParser Method for a long time. After reading many blog posts, I suddenly realized that HTMLParser contains two types of Methods: one is explicitly called, the other class does not need to display the call.

Methods that do not require explicit calling

The following functions will be triggered during parsing, but by default, they will not produce any side effects. Therefore, we need to reload them according to our own needs.

1. HTMLParser. handle_starttag (tag, attrs): start tag call is encountered during parsing. For example, <p class = 'para'>. The parameter tag is the tag name. Here it is 'P ', attrs is the list of all attributes (name, value) of the tag. Here is [('class', 'para')]

2. HTMLParser. handle_endtag (tag): called when an end tag is encountered. The tag is the tag name.

3. HTMLPars. handle_data (data): it is called when the content in the middle of the tag is encountered, for example, <style> p {color: blue ;}</style>. The parameter data is the content between open and closed tags. it is worth noting that in the form of <div> <p>... </p> </div> is not called at div, but only at p.

Of course there are other functions. We will not introduce them here.

Explicit call Method

1. HTMLParser. feed (data): the parameter is the html string to be parsed. After the call, the string is parsed.

2. HTMLParser. getpos (): returns the current row number and offset location, such)

3. HTMLParser. get_starttag_text (): returns the content of the start tag closest to the current location.

After all the content is written, there are still some notes. HTMLParser is just a simple module, and the html parsing function is not perfect. For example, it is impossible to accurately separate tags and "self-closed tags ", see the following code:
Copy codeThe Code is as follows:
From HTMLParser import HTMLParser
Class MyHTMLParser (HTMLParser ):
Def handle_starttag (self, tag, attrs ):
Print 'in in tag', tag
Def handle_startendtag (self, tag, attrs ):
Print 'in in end tag', tag

Str1 = '<br>'
Str2 = '<br/>'
Parser = MyHTMLParser ()

Parser. feed (str1) # output "begin tag br"
Parser. feed (str2) # output "begin end br"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More