At the time of the initial learning of Python, only the DOM and sax two parsing methods are known, but their efficiency is not ideal, because the number of files to be processed is too large, both of these methods are too expensive to accept.
After the network search found that the current application is relatively broad, and the relatively high efficiency of the elementtree is also a more than the recommended algorithm, so take this algorithm to test the comparison, ElementTree also includes two kinds of implementations, one is the General ElementTree (ET), One is Elementtree.iterparse (Et_iter).
This article compares the DOM, SAX, ET, and Et_iter in four ways, and evaluates the efficiency of each algorithm by processing the same file.
In the program, four kinds of parsing methods are written as functions, which are called separately in the main program to evaluate their parsing efficiency.
Examples of the extracted XML file contents are:
The main program function call part code is:
Print ("File count:%d/%d."% (gz_cnt,paser_num)) str_s,cnt = Dom_parser (GZ) #str_s, cnt = Sax_parser (GZ) #str_s, CNT = Et_parser (GZ) #str_s, cnt = Et_parser_iter (GZ) output.write (str_s) vs_cnt + = cnt
In the initial function call, the function returns two values, but the receive function call value is called with two variables, causing each function to execute two times, then modified to call two variables at a time to receive the return value, reducing the invalid invocation.
1. Dom parsing
Function definition Code:
def dom_parser (GZ): Import gzip,cstringio import Xml.dom.minidom vs_cnt = 0 str_s = ' File_io = Cstringio.stringio () XM = Gzip.open (GZ, ' RB ') print ("read in:%s.\n resolution:"% (Os.path.abspath (GZ))) doc = xml.dom.minidom.parseString (Xm.read ()) Bulkpmmrdatafile = doc.documentelement #读入子元素 Enbs = Bulkpmmrdatafile.getelementsbytagname ("ENB") measurements = enbs[ 0].getelementsbytagname ("measurement") objects = Measurements[0].getelementsbytagname ("Object") #写入csv文件 for object In Objects:vs = Object.getelementsbytagname ("V") vs_cnt + = Len (VS) for V in Vs:file_io.write (enbs[0].getat Tribute ("id") + ' +object.getattribute ("id") + ' +\ object.getattribute ("mmeues1apid") + ' +object.getattribute ' (" Mmegroupid ") + ' +object.getattribute (" Mmecode ") + ' +\ object.getattribute (" TimeStamp ") + ' +v.childnodes[0].data+ ' \ n ') #获取文本值 str_s = (((File_io.getvalue (). replace (' \ n ', ' \ r \ n ')). Replace (', ', ')). Replace (' NIL ', '). Xm.close () File_io.close () return (STR_S,VS_CNT)
Program Run Result:
**************************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of. gz files in the input directory is: 12, 12 of which are processed.
**************************************************
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.
In the parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.
In the parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.
In the parsing:
.............................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.
In the parsing:
vs Row Count: 177849, run time: 107.077867, number of rows per second: 1660.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
**************************************************
End of program processing.
Because DOM parsing needs to read the whole file into memory, and build the tree structure, its memory consumption and time consumption are relatively high, but its advantage lies in the logic is simple, do not need to define a callback function, easy to implement.
2. Sax parsing
Function definition Code:
def sax_parser (GZ): Import Os,gzip,cstringio from Xml.parsers.expat import parsercreate #变量声明 D_enb = {} d_obj = {} s = "global flag flag = False File_io = Cstringio.stringio () #Sax解析类 class Defaultsaxhandler (object): #处理开始标签 def start_element (self, Name, attrs): Global D_ENB Global d_obj global vs_cnt if name = = ' ENB ': D_enb = attrs elif name = = ' object ': d_obj = attrs elif name = = ' V ': file_io.write (d_enb[' id ') ]+ ' + d_obj[' id ']+ ' +d_obj[' mmeues1apid ']+ ' ' +d_obj[' mmegroupid ']+ ', ' +d_obj[', Mmecode ']+ ' ' +d_obj[', TimeStamp ']+ ' vs_cnt + = 1 Else:pass #处理中间文本 def char_data (self, text): Global D_enb Global D_obj Glo Bal flag if Text[0:1].isnumeric (): File_io.write (text) elif text[0:17] = = ' MR. LteScPlrULQci1 ': flag = True #print (text,flag) else:pass #处理结束标签 def end_element (self, NA ME): Global D_enb global d_obj if NAme = = ' V ': file_io.write (' \ n ') else:pass #Sax解析调用 handler = Defaultsaxhandler () parser = PARSERCR Eate () parser. Startelementhandler = handler.start_element parser. Endelementhandler = handler.end_element parser. Characterdatahandler = Handler.char_data vs_cnt = 0 str_s = ' XM = Gzip.open (GZ, ' RB ') print ("read in:%s.\n parse:"% (Os.pat H.abspath (GZ))) for the line in Xm.readlines (): parser. Parse (line) #解析xml文件内容 if flag:break str_s = File_io.getvalue (). replace (' \ n ', ' \ r \ n '). Replace (', ', '). Replace (' T ', '). Replace (' NIL ', ') #写入解析后内容 xm.close () File_io.close () return (STR_S,VS_CNT)
Program Run Result:
**************************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of. gz files in the input directory is: 12, 12 of which are processed.
**************************************************
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.
In the parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.
In the parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.
In the parsing:
.........................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.
In the parsing:
vs Row Count: 177849, run time: 14.386779, number of rows per second: 12361.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
**************************************************
End of program processing.
Sax parsing compared to Dom parsing, the running time is greatly shortened, because Sax adopts line-by-row parsing, for processing large files, it consumes less memory, so sax parsing is currently a more applied analytic method. The disadvantage is that it is necessary to implement the callback function, the logic is more complex.
3. ET analysis
Function definition Code:
def et_parser (GZ): import os,gzip,cstringio import xml.etree.cElementTree as ET vs_cnt = 0 str_s = ' C4/>file_io = Cstringio.stringio () XM = Gzip.open (GZ, ' RB ') print ("read in:%s.\n parsing:"% (Os.path.abspath (GZ))) tree = ET. ElementTree (FILE=XM) root = Tree.getroot () for elem in Root[1][0].findall (' object '): For v in Elem.findall (' V '): file_io.write (root[1].attrib[' id ']+ ' +elem.attrib[' TimeStamp ']+ ' +elem.attrib[' Mmecode ']+ ' +\ elem.attrib[' id ']+ ' + elem.attrib[' mmeues1apid ']+ ' + elem.attrib[' mmegroupid ']+ ' + v.text+ ' \ n ') vs_cnt + = 1 str_s = File_io.getvalue (). replace (' \ n ', ' \ r \ n '). Replace (', ', '). Replace (' T ', '). ' NIL ', ') #写入解析后内容 xm.close () file_io.close () return (str_s,vs_cnt)
Program Run Result:
**************************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of. gz files in the input directory is: 12, 12 of which are processed.
**************************************************
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.
In the parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.
In the parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.
In the parsing:
...........................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.
In the parsing:
vs Row Count: 177849, run time: 4.308103, number of rows per second: 41282.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
**************************************************
End of program processing.
Compared with sax parsing, et parsing time is shorter, and the function implementation is relatively simple, so et has a simple logical implementation similar to DOM and the parsing efficiency of sax, so et is the first choice of XML parsing.
4, Et_iter analysis
Function definition Code:
def Et_parser_iter (GZ): Import Os,gzip,cstringio import xml.etree.cElementTree as ET vs_cnt = 0 str_s = ' File_io = Cstringio.stringio () XM = Gzip.open (GZ, ' RB ') print ("read in:%s.\n parse:"% (Os.path.abspath (GZ))) d_en B = {} d_obj = {} i = 0 for Event,elem in Et.iterparse (' Start ', ' End '): If I >= 2:break E Lif event = = ' Start ': if Elem.tag = = ' ENB ': D_enb = elem.attrib elif Elem.tag = = ' object ': D_obj = Elem.attrib elif Event = = ' End ' and elem.tag = = ' SMR ': i + = 1 elif event = = ' End ' and Elem.tag = = ' V ': file_io.write (d_enb[' id ']+ ' +d_obj[' TimeStamp ']+ "' +d_obj[' mmecode ']+ '" +d_obj[' "id ']+ ' '" +\ d_obj[' Mmeu Es1apid ']+ ' + d_obj[' mmegroupid ']+ ' +str (elem.text) + ' \ n ') vs_cnt + = 1 elem.clear () str_s = File_io.getval UE (). replace (' \ n ', ' \ r \ n '). Replace (', ', '). Replace (' NIL ', ') #写入解析后内容 xm.close () file_io.close () re Turn (str_s,vs_cnt)
Program Run Result:
**************************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of. gz files in the input directory is: 12, 12 of which are processed.
**************************************************
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_234598_20160224060000.xml.gz.
In the parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_233798_20160224060000.xml.gz.
In the parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_123798_20160224060000.xml.gz.
In the parsing:
...................................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/td-lte_mro_nsn_omc_235598_20160224060000.xml.gz.
In the parsing:
vs Row Count: 177849, run time: 3.043805, number of rows per second: 58429.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
**************************************************
End of program processing.
After introducing the Et_iter parsing, the analytic efficiency is improved by nearly 50% compared with the ET, and compared with the DOM parsing, it is 35 times times higher, while the parsing efficiency is improved, because it uses iterparse as the sequential analytic tool, its memory footprint is relatively small.
So, little friends, please take advantage of these kinds of tools.
The above is the whole content of this article, I hope that everyone's study has helped.