因為公司內部的介面返回的字串支援2種形式:php數組,xml;結果php數組python不能直接用,而xml字串的格式不是標準的,所以也不能用標準模組解析。【不標準的地方是某些節點會的名稱是以數字開頭的】,所以寫個簡單的腳步來解析一下檔案,用來做介面測試。
#!/usr/bin/env python #encoding: utf-8import reclass xmlparse: def __init__(self, xmlstr): self.xmlstr = xmlstr self.xmldom = self.__convet2utf8() self.xmlnodelist = [] self.xpath = '' def __convet2utf8(self): headstr = self.__get_head() xmldomstr = self.xmlstr.replace(headstr, '') if 'gbk' in headstr: xmldomstr = xmldomstr.decode('gbk').encode('utf-8') elif 'gb2312' in headstr: xmldomstr = self.xmlstr.decode('gb2312').encode('utf-8') return xmldomstr def __get_head(self): headpat = r'<\?xml.*\?>' headpatobj = re.compile(headpat) headregobj = headpatobj.match(self.xmlstr) if headregobj: headstr = headregobj.group() return headstr else: return '' def parse(self, xpath): self.xpath = xpath xpatlist = [] xpatharr = self.xpath.split('/') for xnode in xpatharr: if xnode: spcindex = xnode.find('[') if spcindex > -1: index = int(xnode[spcindex+1:-1]) xnode = xnode[:spcindex] else: index = 0; temppat = ('<%s>(.*?)</%s>' % (xnode, xnode),index) xpatlist.append(temppat) xmlnodestr = self.xmldom for xpat,index in xpatlist: xmlnodelist = re.findall(xpat,xmlnodestr) xmlnodestr = xmlnodelist[index] if xmlnodestr.startswith(r'<![CDATA['): xmlnodestr = xmlnodestr.replace(r'<![CDATA[','')[:-3] self.xmlnodelist = xmlnodelist return xmlnodestr if '__main__' == __name__: xmlstr = '<?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>' xpath1 = '/product_id' xpath2 = '/product_id[1]' xpath3 = '/a/product_id' xp = xmlparse(xmlstr) print 'xmlstr:',xp.xmlstr print 'xmldom:',xp.xmldom print '------------------------------' getstr = xp.parse(xpath1) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath2) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr print '------------------------------' getstr = xp.parse(xpath3) print 'xpath:',xp.xpath print 'get list:',xp.xmlnodelist print 'get string:', getstr
運行結果:
xmlstr: <?xml version="1.0" encoding="utf-8" standalone="yes" ?><resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>xmldom: <resultObject><a><product_id>aaaaa</product_id><product_name><![CDATA[bbbbb]]></a><b><product_id>bbbbb</product_id><product_name><![CDATA[bbbbb]]></b></product_name></resultObject>------------------------------xpath: /product_idget list: ['aaaaa', 'bbbbb']get string: aaaaa------------------------------xpath: /product_id[1]get list: ['aaaaa', 'bbbbb']get string: bbbbb------------------------------xpath: /a/product_idget list: ['aaaaa']get string: aaaaa
因為返回的xml格式比較簡單,沒有帶屬性的節點,所以處理起來就比較簡單了。但測試還是發現有一個bug。即當相同節點嵌套時會出現正則匹配出問題,該問題的可以通過避免在xpath中出現有嵌套節點的名稱來解決,否則只有重寫複雜的機制了。