This article describes how to use HTMLParser to parse HTML in Python, especially when using Python to create crawler programs, if you want to write a search engine, you can refer to it. The first step is to use crawlers to capture the page of the target website, and the second step is to parse the HTML page, check whether the content is news, images, or video.
Assuming that step 1 has been completed, how should I parse HTML in step 2?
HTML is essentially a subset of XML, but the syntax of HTML is not as strict as that of XML, so it cannot be parsed using standard DOM or SAX.
Fortunately, Python provides HTMLParser to parse HTML very conveniently, with just a few lines of code:
from HTMLParser import HTMLParserfrom htmlentitydefs import name2codepointclass MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print('<%s>' % tag) def handle_endtag(self, tag): print('
' % tag) def handle_startendtag(self, tag, attrs): print('<%s/>' % tag) def handle_data(self, data): print('data') def handle_comment(self, data): print('
') def handle_entityref(self, name): print('&%s;' % name) def handle_charref(self, name): print('&#%s;' % name)parser = MyHTMLParser()parser.feed('Some html tutorial...
END
')
The feed () method can be called multiple times, that is, the entire HTML string may not be inserted at one time, but some may be inserted.
There are two special types of characters: English and numbers, which can be parsed by Parser.
Summary
Find a webpage, such as a website.