If we want to write a search engine, the first step is to use the crawler to capture the page of the target site, the second step is to parse the HTML page, to see whether the content is news, pictures or video.
Assuming the first step is complete, how do you parse the HTML in the second step?
HTML is essentially a subset of XML, but HTML syntax is not as strict as XML, so you can't parse HTML with standard DOM or sax.
Fortunately, Python provides htmlparser to parse HTML very easily, with just a few lines of code:
From Htmlparser import htmlparserfrom htmlentitydefs import Name2codepointclass myhtmlparser (htmlparser): def Handle_starttag (self, Tag, attrs): print (' <%s> '% tags) def handle_endtag (self, tag): print ('
'% tag ') def handle_startendtag (self, Tag, attrs): print (' <%s/> '% tags) def handle_data (self, data): print (' Data ') def handle_comment (self, data): print ("
) def handle_entityref (self, name): print (' &%s; '% name) def handle_charref (self, name): print (' &#%s; '% name) parser = Myhtmlparser () parser.feed ('Some HTML Tutorial ...
END
')
The feed () method can be called multiple times, which means that the entire HTML string is not necessarily plugged in at once, and can be partially plugged in.
There are two kinds of special characters, one is in English, the other is the ӓ of digital representation, both of which can be parsed by parser.
Summary
Find a Web page, such as https://www.python.org/events/python-events/, to view the source code in a browser and copy it, then try to parse the HTML and output the time, name, and location of the meeting published on the Python website.