Tutorial on using HTMLParser to parse HTML in Python

Source: Internet
Author: User
This article describes how to use HTMLParser to parse HTML in Python, especially when using Python to create crawler programs, if you want to write a search engine, you can refer to it. The first step is to use crawlers to capture the page of the target website, and the second step is to parse the HTML page, check whether the content is news, images, or video.

Assuming that step 1 has been completed, how should I parse HTML in step 2?

HTML is essentially a subset of XML, but the syntax of HTML is not as strict as that of XML, so it cannot be parsed using standard DOM or SAX.

Fortunately, Python provides HTMLParser to parse HTML very conveniently, with just a few lines of code:

from HTMLParser import HTMLParserfrom htmlentitydefs import name2codepointclass MyHTMLParser(HTMLParser):  def handle_starttag(self, tag, attrs):    print('<%s>' % tag)  def handle_endtag(self, tag):    print('
 ' % tag)  def handle_startendtag(self, tag, attrs):    print('<%s/>' % tag)  def handle_data(self, data):    print('data')  def handle_comment(self, data):    print('
 ')  def handle_entityref(self, name):    print('&%s;' % name)  def handle_charref(self, name):    print('&#%s;' % name)parser = MyHTMLParser()parser.feed('

Some html tutorial...
END

')

The feed () method can be called multiple times, that is, the entire HTML string may not be inserted at one time, but some may be inserted.

There are two special types of characters: English and numbers, which can be parsed by Parser.
Summary

Find a webpage, such as a website.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.