First, Python processes xml
XML refers to Extensible Markup Language (eXtensible Markup Language). XML is designed to transmit and store data. XML is a set of rules that define semantic markup that divides a document into parts and identifies the parts. It is also a meta-markup language that defines syntactic languages used to define other, semantic, structured markup languages that are relevant to a particular domain.
Python the XML the parsing : Common XML programming interfaces have DOM and SAX, and the two interfaces handle XML files in different ways, of course, using different scenarios.
1.SAX (Simple API for XML)
The Python standard library contains sax parsers, andSax uses event-driven models to process XML by triggering events in the parsing of XML and invoking user-defined callback functions. file.
2.DOM (Document Object Model)
Parses XML data into a tree in memory, manipulating the XMLby manipulating the tree.
Note: because the DOM needs to map XML data to the tree in memory, one is slower, the other is more memory consumption, and SAX streams the XML file, faster, Consumes less memory, but requires the user to implement a callback function (handler).
Cases:
Cat Book.xml
<?xml version= "1.0" encoding= "iso-8859-1"? ><bookstore><book><title lang= "Eng" >Harry Potter </title><price>29.99</price></book><book><title lang= "Eng" >Learning XML</ Title><price>39.95</price></book></bookstore>
The relevant code that is handled with Python is as follows:
Import stringfrom xml.parsers.expat import parsercreateclass defaultsaxhandler (Object ): def start_element (self,name,attrs): self.name=name #print (' element:%s, attrs:%s ' % (NAME,STR (attrs))) print ("<" +name+ ">") def end_element (self,name): #print (' end element:%s ' % name) print ("</" +name+ ">") def char_data (Self,text): if Text.strip (): print ("%s ' S text is %s " % (Self.name,text)) Handler = defaultsaxhandler () parser = Parsercreate () parser. StartelemeNthandler=handler.start_elementparser. Endelementhandler=handler.end_elementparser. Characterdatahandler=handler.char_datawith open (' Book.xml ') as f: parser . Parse (F.read ())
Capture the country's various provinces zip code example:
650) this.width=650; "src=" https://s4.51cto.com/oss/201711/15/ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "title=" 11.jpg "alt=" Ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "/>
Import requestsfrom xml.parsers.expat import parsercreateclass defaultsaxhandler ( Object): def __init__ (self,provinces): self.provinces=provinces def start_element (self,name,attrs): if name != ' map ': name = attrs[' title '] number = attrs[' href '] self.provinces.append ((Name,number)) def end_element (self,name): pass def char_data (Self,text): passdef get_province_entry (URL): Content=requests.get (URL).Content.decode (' gb2312 ') start=content.find (' <map name= "map_86" id= "map_86 ">") end=content.find (' </map> ') content=content[start: End+len (' </map> ')].strip () #print (content) provinces = [] handler = defaultsaxhandler (provinces) Parser = parsercreate () parser. Startelementhandler = handler.start_element parser. Endelementhandler = handler.end_element parser. Characterdatahandler = handler.char_data parser. Parse (content) return provincesprovinces=get_province_entry (' http://www.ip138.com/ Post ') print (provinces)
A small example of DOM:
From xml.dom Import Minidomdoc = Minidom.parse (' book.xml ') root = Doc.documentelementprint (root.nodename) Books = Root.getelementsbytagname (' book ') for book in books:titles = Book.getelementsbytagname (' title ') Prices = Book.getele Mentsbytagname (' Price ') print (Titles[0].childnodes[0].nodevalue + ":" + prices[0].childnodes[0].nodevalue)
Second, Htmlparser
The core of Html.parser is the Htmlparser class. The process of working is: when the feed gives it an HTML-like string, it calls the Goahead method to iterate over each label and calls the corresponding Parse_xxxx method to extract Start_tag, tag, data, comment, and End_tag and data, and then call the corresponding method to process the extracted content
Handle_startendtag #处理开始标签和结束标签
Handle_starttag #处理开始标签, such as <xx>
Handle_endtag #处理结束标签, such as </xx> or <....../>
Handle_charref #处理特殊字符串, that is, & #开头的, is usually the inner code representation of the character
Handle_entityref #处理一些特殊字符, beginning with &, for example
Handle_data #处理 <xx>data</xx> The data in the middle
Handle_comment #处理注释
Handle_decl #处理 <!, such as <!. DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"
Handle_pi #处理形如 <?instruction>
Markupbase Installation Method : Direct ' pip install ' cannot be installed successfully, try Command ' pip search Markupbase ' to get package name ' Micropython-markupbase ', then download this package directly on the webpage, after downloading there is a ' _ markupbase.py ' file, remove the filename prefix and copy the file to the Python installation directory ' \lib\site-packages '. Example: CP markupbase.py/usr/local/lib/python3.6/site-packages/
The following example: Process the specified HTML file
#coding =utf-8from htmlparser import htmlparserclass myparser (Htmlparser): "" "A simple example of Htmlparser" " def handle_decl (SELF,&NBSP;DECL): "" Processing header Document "" " Htmlparser.handle_decl (SELF,&NBSP;DECL) print (decl) def handle_starttag (self, tag, attrs): "" Processing start Tag "" " htmlparser.handle_starttag (Self, tag, attrs) if not htmlparser.get_starttag_text (self ). EndsWith ("/>"): print ("<" +tag+ " > ") def handle_data (Self, data): "" Processing text Element "" " htmlparser.handle_data (Self, data) print (data) def handle_endtag (Self, tag): "" Processing end Tag "" " htmlparser.handle_endtag ( Self, tag) if not htmlparser.get_starttag_text ( Self). EndsWith ("/>"): print ("</" + tag+ ">") def handle_startendtag (self, tag, attrs): "" "Handle self-closing label" " htmlparser.handle_ Startendtag (self, tag, attrs) print (HTMLParser.get_ Starttag_text (self)) def handle_comment (Self, data): "" " Process comment "" " htmlparser.handle_comment (self, data) print (data) def close (self): htmlparser.close (self) print (" Parser over ") Demo=myparser () demo.feed (open (" test.html "). Read ()) Demo.close ()
This article is from the "Worknote" blog, make sure to keep this source http://caiyuanji.blog.51cto.com/11462293/1981977
Parsing and processing of XML and HTML