Introduction to the python module-simple HTML and XHTML parser of HTMLParser

Source: Internet
Author: User
Tags processing instruction

2013-09-11 magnet # undertake software automation implementation and training and other gtalk: ouyangchongwu # gmail. comqq 37391319 # blog: http://blog.csdn.net/oychw # copyright, reprinted and published, please contact us by letter # Shenzhen testing automation python project receiving group 113938272 Shenzhen Guangzhou software testing and development 6089740 # Shenzhen Hunan People business outdoor group 66250781 wugang DongKou Chengbu Xinning township group 49494279 # reference materials: python manual note that the HTMLParser module has been renamed as html in Python 3. parser. The 2to3 tool automatically converts the import Statement to python3. Python2.2 adds this module. Source code: lib/HTMLParser. py Introduction This module defines an HTMLParse class as the basis for parsing formatted text files HTML (Hypertext Markup Language) and XHTML. Unlike the htmllib parser, this parser is not based on the SGML parser of the sgmllib module. The instance of class HTMLParser. HTMLParser accepts html data and calls the corresponding processing method when it encounters the starting tag, ending tag, text, comment, and other marking elements. The user needs to inherit HTMLParser and reload some methods to achieve the expected behavior. The HTMLParser class has no parameters. Unlike the htmllib parser, the parser does not check whether the end tag matches the start tag, or call the end tag processor to implicitly close the tag. In addition, htmllib and sgmllib have been canceled in python3 and are not recommended. Exception: exception HTMLParser. HTMLParseError HTMLParser can process fragmented tags. However, in some cases, it may encounter errors and cause exceptions. This exception provides three attributes: msg is a message that briefly describes the error, lineno row number, and offset column offset. Simple Example: fromHTMLParser import HTMLParser fromhtmlentitydefs import name2codepoint classMyHTMLParser (HTMLParser): def handle_starttag (self, tag, attrs): print "Start tag:", tag for attr in attrs: print "attr:", attr def handle_endtag (self, tag): print "End tag:", tag def handle_data (self, data): print "Data :", data def handle_comment (self, data): print "Comment:", data def handle_entityref (self, n Ame): c = unichr (name2codepoint [name]) print "Named ent:", c def handle_charref (self, name): if name. startswith ('x'): c = unichr (int (name [1:], 16) else: c = unichr (int (name) print "Num ent :", c def handle_decl (self, data): print "Decl:", data parser = MyHTMLParser () printparser. feed ('<! Doctype html public "-// W3C // dtd html 4.01 // EN" ''" http://www.w3.org/TR/html4/strict.dtd "> ') printparser. feed ('') execution result: Decl: doctype html public "-// W3C // DTDHTML 4.01 // EN" "http://www.w3.org/TR/html4/strict.dtd" None Starttag: img attr: ('src', 'python-logo.png ') attr: ('alt ', the Python logo ') None HTMLParser method: HTMLParser. feed (data) submits text to the parser. Incomplete Data is buffered until more data is sent or close () is called. The data can be Unicode (recommended) or str. HTMLParser. close () forces the processing of all buffered data, as if it is like a file end sign. A derived class can be redefined. After the input is finished, the redefined version always calls the base class method close () of HTMLParser (). HTMLParser. reset () resets the instance. All unprocessed data will be lost. HTMLParser. getpos () returns the current row number and offset. HTMLParser. get_starttag_text () returns the text of the most recent tag. Not commonly used, and occasionally used to compare text. The following method is called when data or tags are encountered. Except handle_startendtag, the parent class usually does nothing and must be overloaded by subclass. HTMLParser. handle_starttag (tag, attrs) to process a tag (for example, <divid = "main">. The tag name is converted to lowercase. The attrs parameter of is a list that contains the (name, value) pairs in the TAG. Convert Name to lowercase. For example, it will be converted into handle_starttag ('A', [('href ', 'HTTP: // www. cwi. nl/')]). Changed in version 2.6: All entity references of htmlentitydefs are changed to attribute values. HTMLParser. handle_endtag (tag) HTMLParser. handle_endtag (tag) process the end tag, which is also converted to lowercase. HTMLParser. handle_startendtag (tag, attrs) is similar to handle_starttag (). If you encounter an empty XHTML-style tag (for example, . By default, only handle_starttag () and handle_endtag () are called (). HTMLParser. handle_data (data) is used to process arbitrary data (such as <SCRIPT> of text nodes and content... </SCRIPT> and <style>... </style> ).. HTMLParser. handle_entityref (name) processes the specified character reference form & name; (e.g. & gt;). The name is a generic Entity reference (e.g. 'gt '). HTMLParser. handle_charref (name) is used to reference decimal and hexadecimal numbers & # NNN; and & # xNNN. For example, the & gt in decimal format is & #62, and The hexadecimal value is & # x3E;. In this case, this method will receive '62' or 'x3e '. HTMLParser. handle_comment (data) Processing annotation (e.g. <! -- Comment --> ). For example, comment <! -Comment-> this method is called "comment. Internet Explorer condition comments (condcoms) content will also be sent to this method, so <! -- [If IE 9]> IE9-specificcontent <! [Endif, this method will receive '[if IE 9]> IE-specificcontent <! [Endif] '. HTMLParser. handle_decl (decl) processes the html doctype declaration (for example, (e.g. <! DOCTYPE html>). HTMLParser. handle_pi (data) encountered when processing commands. The data parameter will contain the entire processing command. For example, for processing commands <? Proc color = 'red'>, this method is called as handle_pi ("proccolor = 'red '"). Note: The HTMLParser class uses the Processing Instruction of SGML syntax rules. What is the end of an XHTML processing instruction? It is also included in the data. The HTMLParser class uses theSGML syntactic rules for processing instructions. An XHTML processinginstruction using the trailing '? Will cause '? 'To be aware dedin data. HTMLParser. unknown_decl (data) processes unrecognized validation declarations. More instances: fromHTMLParser import HTMLParser fromhtmlentitydefs import name2codepoint classMyHTMLParser (HTMLParser): def handle_starttag (self, tag, attrs): print "Start tag:", tag for attr in attrs: print "attr:", attr def handle_endtag (self, tag): print "End tag:", tag def handle_data (self, data): print "Data :", data def handle_comment (self, data): print "Comment:", data def handle_entityref (self, n Ame): c = unichr (name2codepoint [name]) print "Named ent:", c def handle_charref (self, name): if name. startswith ('x'): c = unichr (int (name [1:], 16) else: c = unichr (int (name) print "Num ent :", c def handle_decl (self, data): print "Decl:", data parser = MyHTMLParser () print "DOCTYPE:" parser. feed ('<! Doctypehtml public "-// W3C // dtd html 4.01 // EN" ''" http://www.w3.org/TR/html4/strict.dtd "> ') print" img: "parser. feed ('') print "h1:" parser. feed ('

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.