Parsing and processing of XML and HTML

Source: Internet
Author: User
Tags processing text

First, Python processes xml

XML refers to Extensible Markup Language (eXtensible Markup Language). XML is designed to transmit and store data. XML is a set of rules that define semantic markup that divides a document into parts and identifies the parts. It is also a meta-markup language that defines syntactic languages used to define other, semantic, structured markup languages that are relevant to a particular domain.

Python the XML the parsing : Common XML programming interfaces have DOM and SAX, and the two interfaces handle XML files in different ways, of course, using different scenarios.

1.SAX (Simple API for XML)

The Python standard library contains sax parsers, andSax uses event-driven models to process XML by triggering events in the parsing of XML and invoking user-defined callback functions. file.

2.DOM (Document Object Model)

Parses XML data into a tree in memory, manipulating the XMLby manipulating the tree.

Note: because the DOM needs to map XML data to the tree in memory, one is slower, the other is more memory consumption, and SAX streams the XML file, faster, Consumes less memory, but requires the user to implement a callback function (handler).

Cases:

Cat Book.xml

<?xml version= "1.0" encoding= "iso-8859-1"? ><bookstore><book><title lang= "Eng" >Harry Potter </title><price>29.99</price></book><book><title lang= "Eng" >Learning XML</ Title><price>39.95</price></book></bookstore>

The relevant code that is handled with Python is as follows:

Import stringfrom xml.parsers.expat import parsercreateclass defaultsaxhandler (Object ):     def start_element (self,name,attrs):         self.name=name         #print (' element:%s, attrs:%s '  %   (NAME,STR (attrs)))         print ("<" +name+ ">")      def end_element (self,name):         #print (' end  element:%s '  % name)         print ("</" +name+ ">")      def char_data (Self,text):        if  Text.strip ():             print ("%s ' S text  is %s " %  (Self.name,text)) Handler = defaultsaxhandler () parser =  Parsercreate () parser. StartelemeNthandler=handler.start_elementparser. Endelementhandler=handler.end_elementparser. Characterdatahandler=handler.char_datawith open (' Book.xml ')  as f:    parser . Parse (F.read ())

Capture the country's various provinces zip code example:

650) this.width=650; "src=" https://s4.51cto.com/oss/201711/15/ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "title=" 11.jpg "alt=" Ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "/>

Import requestsfrom xml.parsers.expat import parsercreateclass defaultsaxhandler ( Object):     def __init__ (self,provinces):         self.provinces=provinces    def start_element (self,name,attrs):         if name !=  ' map ':             name = attrs[' title ']             number = attrs[' href ']             self.provinces.append ((Name,number))     def end_element (self,name):         pass    def char_data (Self,text):         passdef get_province_entry (URL):     Content=requests.get (URL).Content.decode (' gb2312 ')     start=content.find (' <map name= "map_86"  id= "map_86 ">")     end=content.find (' </map> ')     content=content[start: End+len (' </map> ')].strip ()      #print (content)     provinces  = []    handler = defaultsaxhandler (provinces)      Parser = parsercreate ()     parser. Startelementhandler = handler.start_element    parser. Endelementhandler = handler.end_element    parser. Characterdatahandler = handler.char_data    parser. Parse (content)     return provincesprovinces=get_province_entry (' http://www.ip138.com/ Post ') print (provinces)

A small example of DOM:

From xml.dom Import Minidomdoc = Minidom.parse (' book.xml ') root = Doc.documentelementprint (root.nodename) Books = Root.getelementsbytagname (' book ') for book in books:titles = Book.getelementsbytagname (' title ') Prices = Book.getele Mentsbytagname (' Price ') print (Titles[0].childnodes[0].nodevalue + ":" + prices[0].childnodes[0].nodevalue)


Second, Htmlparser

The core of Html.parser is the Htmlparser class. The process of working is: when the feed gives it an HTML-like string, it calls the Goahead method to iterate over each label and calls the corresponding Parse_xxxx method to extract Start_tag, tag, data, comment, and End_tag and data, and then call the corresponding method to process the extracted content

Handle_startendtag #处理开始标签和结束标签

Handle_starttag #处理开始标签, such as <xx>

Handle_endtag #处理结束标签, such as </xx> or <....../>

Handle_charref #处理特殊字符串, that is, & #开头的, is usually the inner code representation of the character

Handle_entityref #处理一些特殊字符, beginning with &, for example &nbsp;

Handle_data #处理 <xx>data</xx> The data in the middle

Handle_comment #处理注释

Handle_decl #处理 <!, such as <!. DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"

Handle_pi #处理形如 <?instruction>

Markupbase Installation Method : Direct ' pip install ' cannot be installed successfully, try Command ' pip search Markupbase ' to get package name ' Micropython-markupbase ', then download this package directly on the webpage, after downloading there is a ' _ markupbase.py ' file, remove the filename prefix and copy the file to the Python installation directory ' \lib\site-packages '. Example: CP markupbase.py/usr/local/lib/python3.6/site-packages/

The following example: Process the specified HTML file

#coding =utf-8from htmlparser import htmlparserclass myparser (Htmlparser):      "" "A simple example of Htmlparser" "    def handle_decl (SELF,&NBSP;DECL):          "" Processing header Document "" "         Htmlparser.handle_decl (SELF,&NBSP;DECL)         print (decl)      def handle_starttag (self, tag, attrs):          "" Processing start Tag "" "        htmlparser.handle_starttag (Self, tag,  attrs)         if not htmlparser.get_starttag_text (self ). EndsWith ("/>"):             print ("<" +tag+ " > ")     def handle_data (Self, data):          "" Processing text Element "" "         htmlparser.handle_data (Self, data)          print (data)     def handle_endtag (Self, tag):          "" Processing end Tag "" "        htmlparser.handle_endtag ( Self, tag)         if not htmlparser.get_starttag_text ( Self). EndsWith ("/>"):             print ("</" + tag+ ">")     def handle_startendtag (self, tag, attrs):          "" "Handle self-closing label" "        htmlparser.handle_ Startendtag (self, tag, attrs)         print (HTMLParser.get_ Starttag_text (self))     def handle_comment (Self, data):          "" " Process comment "" "        htmlparser.handle_comment (self, data)          print (data)     def close (self):         htmlparser.close (self)         print (" Parser over ") Demo=myparser () demo.feed (open (" test.html "). Read ()) Demo.close ()


This article is from the "Worknote" blog, make sure to keep this source http://caiyuanji.blog.51cto.com/11462293/1981977

Parsing and processing of XML and HTML

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.