Horizontal comparison and analysis of Python to parse XML in four ways, pythonxml

Source: Internet
Author: User

Horizontal comparison and analysis of Python to parse XML in four ways, pythonxml

When I first learned PYTHON, I only knew there were DOM and SAX parsing methods, but they were not very efficient. because the number of files to be processed was too large, these two methods are too time-consuming and unacceptable.

After searching through the network, we found that ElementTree, which is widely used and highly efficient, is also an algorithm recommended by many people. Therefore, we use this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is normal ElementTree (ET) and the other is ElementTree. iterparse (ET_iter ).

This article compares DOM, SAX, ET, and ET_iter horizontally, and evaluates the efficiency when comparing algorithms by processing the same file.

The program writes all four parsing methods as functions and calls them in the main program to evaluate the parsing efficiency.

The following is an example of the extracted XML file:

Part of the main program function call code is:

Print ("file count: % d/% d. "% (gz_cnt, paser_num) str_s, cnt = dom_parser (gz) # str_s, cnt = sax_parser (gz) # str_s, cnt = ET_parser (gz) # str_s, cnt = ET_parser_iter (gz) output. write (str_s) vs_cnt + = cnt

In the initial function call, the function returns two values. However, when receiving the function call value, two variables are called respectively. As a result, each function must be executed twice, then, it is modified to receive the return values for one call to reduce invalid calls.

1. DOM Parsing

Function Definition code:

Def dom_parser (gz): import gzip, cStringIO import xml. dom. minidom vs_cnt = 0 str_s = ''file_io = cStringIO. stringIO () xm = gzip. open (gz, 'rb') print ("read: % s. \ n parsing: "% (OS. path. abspath (gz) doc = xml. dom. minidom. parseString (xm. read () bulkPmMrDataFile = doc.doc umentElement # read sub-element enbs = bulkPmMrDataFile. getElementsByTagName ("eNB") measurements = enbs [0]. getElementsByTagName ("measurement") objects = measurements [0]. getElementsByTagName ("object") # Write A csv file for object in objects: vs = object. getElementsByTagName ("v") vs_cnt + = len (vs) for v in vs: file_io.write (enbs [0]. getAttribute ("id") + ''+ object. getAttribute ("id") + ''+ \ object. getAttribute ("MmeUeS1apId") + ''+ object. getAttribute ("MmeGroupId") + ''+ object. getAttribute ("MmeCode") + ''+ \ object. getAttribute ("TimeStamp") + ''+ v. childNodes [0]. data + '\ n') # obtain the text value str_s = (file_io.getvalue (). replace ('\ n',' \ r \ n ')). replace ('',',')). replace ('T ','')). replace ('nil ', '') xm. close () file_io.close () return (str_s, vs_cnt)

Program running result:

**************************************** **********
Program processing starts.
The input directory is/tmcdata/mro2csv/input31 /.
The output directory is/tmcdata/mro2csv/output31 /.
The number of .gz files in the input directory is 12, 12 of which are processed this time.
**************************************** **********
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
.............................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, Run Time: 107.077867, number of rows processed per second: 1660.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************** **********
The program processing is complete.
Because DOM parsing needs to read the entire file into the memory and establish a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and callback functions are not needed for implementation.

2. SAX Parsing

Function Definition code:

Def sax_parser (gz): import OS, gzip, cStringIO from xml. parsers. expat import ParserCreate # variable declaration d_eNB ={} d_obj ={} s = ''global flag = False file_io = cStringIO. stringIO () # Sax parsing class DefaultSaxHandler (object): # Processing start tag def start_element (self, name, attrs ): global d_eNB global d_obj global vs_cnt if name = 'enabb': d_eNB = attrs elif name = 'object': d_obj = attrs elif name = 'V ': file_io.write (d_eNB ['id'] + ''+ d_obj ['id'] +'' + d_obj ['mmeues1apid'] + ''+ d_obj ['mmegroupid '] +' '+ d_obj ['mcodec'] + ''+ d_obj ['timestamp'] + '') vs_cnt + = 1 else: pass # process intermediate text def char_data (self, text): global d_eNB global d_obj global flag if text []. isnumeric (): file_io.write (text) elif text [0: 17] = 'Mr. lteScPlrULQci1 ': flag = True # print (text, flag) else: pass # process the end tag def end_element (self, name): global d_eNB global d_obj if name = 'V ': file_io.write ('\ n') else: pass # Sax parsing call handler = DefaultSaxHandler () parser = ParserCreate () parser. startElementHandler = handler. start_element parser. endElementHandler = handler. end_element parser. characterDataHandler = handler. char_data vs_cnt = 0 str_s = ''xm = gzip. open (gz, 'rb') print ("read: % s. \ n parsing: "% (OS. path. abspath (gz) for line in xm. readlines (): parser. parse (line) # Parse xml file content if flag: break str_s = file_io.getvalue (). replace ('\ n',' \ r \ n '). replace ('',','). replace ('T ',''). replace ('nil ', '') # Write the parsed content xm. close () file_io.close () return (str_s, vs_cnt)

Program running result:

**************************************** **********
Program processing starts.
The input directory is/tmcdata/mro2csv/input31 /.
The output directory is/tmcdata/mro2csv/output31 /.
The number of .gz files in the input directory is 12, 12 of which are processed this time.
**************************************** **********
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
.........................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, Run Time: 14.386779, number of rows processed per second: 12361.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************** **********
The program processing is complete.
Compared with DOM parsing, the running time of SAX Parsing is greatly shortened. Due to the use of line-by-line parsing, it occupies less memory for processing large files, therefore, SAX Parsing is a method that is widely used at present. The disadvantage is that you need to implement the callback function by yourself, and the logic is complicated.

3. ET Analysis

Function Definition code:

Def ET_parser (gz): import OS, gzip, cStringIO import xml. etree. cElementTree as ET vs_cnt = 0 str_s = ''file_io = cStringIO. stringIO () xm = gzip. open (gz, 'rb') print ("read: % s. \ n parsing: "% (OS. path. abspath (gz) tree = ET. elementTree (file = xm) root = tree. getroot () for elem in root [1] [0]. findall ('object'): for v in elem. findall ('V'): file_io.write (root [1]. attrib ['id'] + ''+ elem. attrib ['timestamp'] + ''+ elem. attrib ['mdecode'] + ''+ \ elem. attrib ['id'] + ''+ elem. attrib ['mmeues1apid'] + ''+ elem. attrib ['mmegroupid '] + ''+ v. text + '\ n') vs_cnt + = 1 str_s = file_io.getvalue (). replace ('\ n',' \ r \ n '). replace ('',','). replace ('T ',''). replace ('nil ', '') # Write the parsed content xm. close () file_io.close () return (str_s, vs_cnt)

Program running result:

**************************************** **********
Program processing starts.
The input directory is/tmcdata/mro2csv/input31 /.
The output directory is/tmcdata/mro2csv/output31 /.
The number of .gz files in the input directory is 12, 12 of which are processed this time.
**************************************** **********
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
...........................................
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, Run Time: 4.308103, number of rows processed per second: 41282.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************** **********
The program processing is complete.
Compared with the SAX parsing, The ET parsing time is shorter, and the function implementation is also relatively simple. Therefore, ET has a simple logic implementation similar to DOM and matches the parsing efficiency of SAX, therefore, ET is the first choice for XML parsing.

4. ET_iter Parsing

Function Definition code:

Def ET_parser_iter (gz): import OS, gzip, cStringIO import xml. etree. cElementTree as ET vs_cnt = 0 str_s = ''file_io = cStringIO. stringIO () xm = gzip. open (gz, 'rb') print ("read: % s. \ n parsing: "% (OS. path. abspath (gz) d_eNB = {} d_obj = {} I = 0 for event, elem in ET. iterparse (xm, events = ('start', 'end'): if I> = 2: break elif event = 'start': if elem. tag = 'enable': d_eNB = elem. attrib elif elem. tag = 'object': d_obj = elem. attrib elif event = 'end' and elem. tag = 'smr': I + = 1 elif event = 'end' and elem. tag = 'V ': file_io.write (d_eNB ['id'] + ''+ d_obj ['timestamp'] +'' + d_obj ['mmeicode'] + ''+ d_obj ['id'] +' '+ \ d_obj ['mmeues1apid'] + ''+ d_obj ['mmegroupid'] +'' + str (elem. text) + '\ n') vs_cnt + = 1 elem. clear () str_s = file_io.getvalue (). replace ('\ n',' \ r \ n '). replace ('',','). replace ('T ',''). replace ('nil ', '') # Write the parsed content xm. close () file_io.close () return (str_s, vs_cnt)

Program running result:

**************************************** **********
Program processing starts.
The input directory is/tmcdata/mro2csv/input31 /.
The output directory is/tmcdata/mro2csv/output31 /.
The number of .gz files in the input directory is 12, 12 of which are processed this time.
**************************************** **********
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
........................................ ...........
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, Run Time: 3.043805, number of rows processed per second: 58429.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.

**************************************** **********
The program processing is complete.
After ET_iter resolution is introduced, the resolution efficiency is improved by nearly 50% compared with ET, and 35 times higher than DOM resolution, because it uses the sequential parsing tool iterparse, its memory usage is relatively small.

Therefore, please make good use of these tools.
The above is all the content of this article, hoping to help you learn.

Articles you may be interested in:
  • Parse XML files using Python
  • Create an XML document using PYTHON
  • Python uses ElementTree to operate XML to get the node reading attribute beautifying XML
  • How to Use xmlrpc in python
  • Python parsing XML python module xml. dom parsing xml instance code
  • Python operations on xml files
  • Python parsing xml file operation instance
  • Introduction to several common methods for parsing XML using Python
  • How to obtain any xml node value using Python
  • Using ElementTree to parse XML in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.