Python custom parsing of simple xml format files, pythonxml
This example describes how to customize python parsing simple xml format files. Share it with you for your reference. The specific analysis is as follows:
Because the strings returned by the Internal interface of the company support two forms: php array and xml; the php array python cannot be used directly, but the xml string format is not standard, therefore, standard module Parsing is not supported. [It is not standard that the names of some nodes start with numbers]. Therefore, write a simple step to parse the file for interface testing.
#!/usr/bin/env python#encoding: utf-8import reclass xmlparse: def __init__(self, xmlstr): self.xmlstr = xmlstr self.xmldom = self.__convet2utf8() self.xmlnodelist = [] self.xpath = '' def __convet2utf8(self): headstr = self.__get_head() xmldomstr = self.xmlstr.replace(headstr, '') if 'gbk' in headstr: xmldomstr = xmldomstr.decode('gbk').encode('utf-8') elif 'gb2312' in headstr: xmldomstr = self.xmlstr.decode('gb2312').encode('utf-8') return xmldomstr def __get_head(self): headpat = r'<\?xml.*\?>' headpatobj = re.compile(headpat) headregobj = headpatobj.match(self.xmlstr) if headregobj: headstr = headregobj.group() return headstr else: return '' def parse(self, xpath): self.xpath = xpath xpatlist = [] xpatharr = self.xpath.split('/') for xnode in xpatharr: if xnode: spcindex = xnode.find('[') if spcindex > -1: index = int(xnode[spcindex+1:-1]) xnode = xnode[:spcindex] else: index = 0; temppat = ('<%s>(.*?)</%s>' % (xnode, xnode),index) xpatlist.append(temppat) xmlnodestr = self.xmldom for xpat,index in xpatlist: xmlnodelist = re.findall(xpat,xmlnodestr) xmlnodestr = xmlnodelist[index] if xmlnodestr.startswith(r'<![CDATA['): xmlnodestr = xmlnodestr.replace(r'<![CDATA[','')[:-3] self.xmlnodelist = xmlnodelist return xmlnodestrif '__main__' == __name__: xmlstr = '<?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>' xpath1 = '/product_id' xpath2 = '/product_id[1]' xpath3 = '/a/product_id' xp = xmlparse(xmlstr) print 'xmlstr:',xp.xmlstr print 'xmldom:',xp.xmldom print '------------------------------' getstr = xp.parse(xpath1) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath2) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath3) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr
Running result:
xmlstr: <?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>xmldom: <resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>------------------------------xpath: /product_idget list: ['aaaaa', 'bbbbb']get string: aaaaa------------------------------xpath: /product_id[1] get list: ['aaaaa', 'bbbbb']get string: bbbbb------------------------------xpath: /a/product_idget list: ['aaaaa']get string: aaaaa
Because the returned xml format is relatively simple and there are no nodes with attributes, it is easier to process. However, the test still found a bug. That is, when the same node is nested, a regular expression matching problem occurs. This problem can be solved by avoiding the nested node name in xpath. Otherwise, only a complicated rewrite mechanism is required.
I hope this article will help you with Python programming.