This article mainly introduces the python custom parsing method for simple xml format files. it involves the skills related to Python parsing XML files and is very useful, for more information about parsing simple xml files, see the example in this article. Share it with you for your reference. The specific analysis is as follows:
Because the strings returned by the internal interface of the company support two forms: php array and xml; the php array python cannot be used directly, but the xml string format is not standard, therefore, standard module parsing is not supported. [It is not standard that the names of some nodes start with numbers]. therefore, write a simple step to parse the file for interface testing.
#!/usr/bin/env python#encoding: utf-8import reclass xmlparse: def __init__(self, xmlstr): self.xmlstr = xmlstr self.xmldom = self.__convet2utf8() self.xmlnodelist = [] self.xpath = '' def __convet2utf8(self): headstr = self.__get_head() xmldomstr = self.xmlstr.replace(headstr, '') if 'gbk' in headstr: xmldomstr = xmldomstr.decode('gbk').encode('utf-8') elif 'gb2312' in headstr: xmldomstr = self.xmlstr.decode('gb2312').encode('utf-8') return xmldomstr def __get_head(self): headpat = r'<\?xml.*\?>' headpatobj = re.compile(headpat) headregobj = headpatobj.match(self.xmlstr) if headregobj: headstr = headregobj.group() return headstr else: return '' def parse(self, xpath): self.xpath = xpath xpatlist = [] xpatharr = self.xpath.split('/') for xnode in xpatharr: if xnode: spcindex = xnode.find('[') if spcindex > -1: index = int(xnode[spcindex+1:-1]) xnode = xnode[:spcindex] else: index = 0; temppat = ('<%s>(.*?)
' % (xnode, xnode),index) xpatlist.append(temppat) xmlnodestr = self.xmldom for xpat,index in xpatlist: xmlnodelist = re.findall(xpat,xmlnodestr) xmlnodestr = xmlnodelist[index] if xmlnodestr.startswith(r''): xmlnodestr = xmlnodestr.replace(r'<![CDATA[','')[:-3] self.xmlnodelist = xmlnodelist return xmlnodestrif '__main__' == __name__: xmlstr = '<?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb
bbbbb
bbbbb
' xpath1 = '/product_id' xpath2 = '/product_id[1]' xpath3 = '/a/product_id' xp = xmlparse(xmlstr) print 'xmlstr:',xp.xmlstr print 'xmldom:',xp.xmldom print '------------------------------' getstr = xp.parse(xpath1) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath2) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath3) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr
Running result:
xmlstr: <?xml version="1.0" encoding="utf-8" standalone="yes" ?>
aaaaa
bbbbb
bbbbb
bbbbb
xmldom:
aaaaa
bbbbb
bbbbb
bbbbb
------------------------------xpath: /product_idget list: ['aaaaa', 'bbbbb']get string: aaaaa------------------------------xpath: /product_id[1] get list: ['aaaaa', 'bbbbb']get string: bbbbb------------------------------xpath: /a/product_idget list: ['aaaaa']get string: aaaaa
Because the returned xml format is relatively simple and there are no nodes with attributes, it is easier to process. However, the test still found a bug. That is, when the same node is nested, a regular expression matching problem occurs. this problem can be solved by avoiding the nested node name in xpath. Otherwise, only a complicated rewrite mechanism is required.
I hope this article will help you with Python programming.