Python provides parsing of the Sgmlparser class for HTML files. The user simply inherits the subclass from the Sgmlparser class and does a specific processing of the HTML file in the subclass.
For example, an HTML file with the following structure
[HTML] view plain copy <div class= ' entry-content ' > <p> What's interesting 1</p> <p> interesting content 2</p> ... <p> interested content n</p> </div> <div class= ' content ' > <p> content 1</p> <p> content 2< /p> ... <p> content n</p> </div>
We try to get ' content of interest '
For text content, we save it to idlist.
But how to mark the text we encounter is the content of interest, that is, in
[HTML] view plain copy <div class= ' entry-content ' > <p> here's content </p> <p> and here </p> ... <p> and here's the content </p> </div> ideas are as follows
Encountered <div class= ' entry-content ' > Set mark flag = True after </div> set tag flag = False When flag is True the <p> settings Tag GetData = True encounters </p> GetData = true, setting GetData = False
Python provides us with the Sgmlparser class, Sgmlparser parse HTML into 8 class data [1], and then invoke a separate method for each class: just inherit the Sgmlparser class and write the processing function of the page information.
the following processing functions are available:start tag ( start tag)is an HTML tag that starts a block, like Start_tagnameOrDo_tagnameThe method. For example, when it finds a <pre> tag, it looks for a start_pre or Do_pre method. If found, Sgmlparser uses the tag's list of attributes to invoke the method, otherwise it uses the tag's name and property list to invoke theUnknown_starttagMethod.closing tag (end tag)is the HTML tag that ends a block, like End_tagnameThe method. If found, Sgmlparser calls this method, otherwise it uses the name of the tag to invoke theUnknown_endtag。character reference (Character reference)An escape character expressed as a decimal or equivalent hexadecimal character, as in & #160;. When found, Sgmlparser uses decimal or equivalent hexadecimal runes to invoke theHandle_charref。entity references (Entity reference)HTML entities, like ©. When found, Sgmlparser uses the name of the HTML entity to invoke theHandle_entityref。Note (Comment)HTML comments, including between <!--...-->. When found, the Sgmlparser is invoked with the annotation content.handle_comment。processing instructions (processing instruction)HTML processing instructions, including in the;? .. > Between. When found, the Sgmlparser is invoked with the processing instruction content.Handle_pi。Statement (Declaration)HTML declarations, such as DOCTYPE, included in the <! .. > Between. When found, Sgmlparser is invoked with the declared content.handle_decl。Textual data (text)The text block. Does not meet any of the other 7 categories of things. When found, Sgmlparser is called with textHandle_data。
Fully, to the following code
[python] view plain Copy from Sgmllib import Sgmlparser