Tutorials for parsing HTML in Python using Htmlparser

Source: Internet
Author: User
If we want to write a search engine, the first step is to use the crawler to capture the page of the target site, the second step is to parse the HTML page, to see whether the content is news, pictures or video.

Assuming the first step is complete, how do you parse the HTML in the second step?

HTML is essentially a subset of XML, but HTML syntax is not as strict as XML, so you can't parse HTML with standard DOM or sax.

Fortunately, Python provides htmlparser to parse HTML very easily, with just a few lines of code:

From Htmlparser import htmlparserfrom htmlentitydefs import Name2codepointclass myhtmlparser (htmlparser):  def Handle_starttag (self, Tag, attrs):    print (' <%s> '% tags)  def handle_endtag (self, tag):    print ('
 
  '% tag ')  def handle_startendtag (self, Tag, attrs):    print (' <%s/> '% tags)  def handle_data (self, data):    print (' Data ')  def handle_comment (self, data):    print ("
 
   )  def handle_entityref (self, name):    print (' &%s; '% name)  def handle_charref (self, name):    print (' &#%s; '% name) parser = Myhtmlparser () parser.feed ('

Some HTML Tutorial ...
END

')

The feed () method can be called multiple times, which means that the entire HTML string is not necessarily plugged in at once, and can be partially plugged in.

There are two kinds of special characters, one is in English, the other is the ӓ of digital representation, both of which can be parsed by parser.
Summary

Find a Web page, such as https://www.python.org/events/python-events/, to view the source code in a browser and copy it, then try to parse the HTML and output the time, name, and location of the meeting published on the Python website.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.