Use Python to extract specific data from an HTML file

Source: Internet
Author: User
Tags closing tag list of attributes processing instruction
Python provides parsing of the Sgmlparser class for HTML files. The user simply inherits the subclass from the Sgmlparser class and does a specific processing of the HTML file in the subclass.

For example, an HTML file with the following structure

[HTML] view plain copy <div class= ' entry-content ' > <p> What's interesting 1</p> <p> interesting content 2</p> ... <p> interested content n</p> </div> <div class= ' content ' > <p> content 1</p> <p> content 2< /p> ... <p> content n</p> </div>

We try to get ' content of interest '

For text content, we save it to idlist.
But how to mark the text we encounter is the content of interest, that is, in
[HTML] view plain copy <div class= ' entry-content ' > <p> here's content </p> <p> and here </p> ... <p> and here's the content </p> </div> ideas are as follows
Encountered <div class= ' entry-content ' > Set mark flag = True after </div> set tag flag = False When flag is True the <p> settings Tag GetData = True encounters </p> GetData = true, setting GetData = False

Python provides us with the Sgmlparser class, Sgmlparser parse HTML into 8 class data [1], and then invoke a separate method for each class: just inherit the Sgmlparser class and write the processing function of the page information.

the following processing functions are availablestart tag ( start tag)is an HTML tag that starts a block, like Start_tagnameOrDo_tagnameThe method. For example, when it finds a <pre> tag, it looks for a start_pre or Do_pre method. If found, Sgmlparser uses the tag's list of attributes to invoke the method, otherwise it uses the tag's name and property list to invoke theUnknown_starttagMethod.closing tag (end tag)is the HTML tag that ends a block, like End_tagnameThe method. If found, Sgmlparser calls this method, otherwise it uses the name of the tag to invoke theUnknown_endtagcharacter reference (Character reference)An escape character expressed as a decimal or equivalent hexadecimal character, as in & #160;. When found, Sgmlparser uses decimal or equivalent hexadecimal runes to invoke theHandle_charrefentity references (Entity reference)HTML entities, like &copy;. When found, Sgmlparser uses the name of the HTML entity to invoke theHandle_entityrefNote (Comment)HTML comments, including between <!--...-->. When found, the Sgmlparser is invoked with the annotation content.handle_commentprocessing instructions (processing instruction)HTML processing instructions, including in the;? .. > Between. When found, the Sgmlparser is invoked with the processing instruction content.Handle_piStatement (Declaration)HTML declarations, such as DOCTYPE, included in the <! .. > Between. When found, Sgmlparser is invoked with the declared content.handle_declTextual data (text)The text block. Does not meet any of the other 7 categories of things. When found, Sgmlparser is called with textHandle_data


Fully, to the following code

[python] view plain Copy from Sgmllib import Sgmlparser

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.