Four XML parsing techniques in Java

Last Update:2013-10-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In normal work, it is inevitable that XML will be used as the data storage format. Which of the following solutions is the most suitable for us? In this article, I made an incomplete evaluation of the four mainstream solutions, and only tested the XML traversal, because XML traversal is the most commonly used at work (at least I think ).

Pre-Backup

　　Test environment:

AMD Poison Dragon 1.4g oc 1.5G, 256 M DDR333, Windows2000 Server SP4, Sun JDK 1.4.1 + Eclipse 2.1 + Resin 2.1.8, tested in Debug mode.

The XML file format is as follows:

<? Xml version = "1.0" encoding = "GB2312"?>
<RESULT>
<VALUE>
<NO> A1234 </NO>
<ADDR> No. XX, section X, XX Road, XX Town, XX County, Sichuan Province </ADDR>
</VALUE>
<VALUE>
<NO> B1234 </NO>
<ADDR> XX group, XX village, XX Township, xxx City, Sichuan Province </ADDR>
</VALUE>
</RESULT>

　　Test method:

Use JSP end to call Bean (as to why JSP is used to call, please refer to: http://blog.csdn.net/rosen/archive/2004/10/15/138324.aspx), let each solution respectively parse 10 K, 100 K, 1000 K, k xml files, calculate the consumed time (unit: milliseconds ).

JSP file:

<% @ Page contentType = "text/html; charset = gb2312" %>
<% @ Page import = "com. test. *" %>

<Html>
<Body>
<%
String args [] = {""};
MyXMLReader. main (args );
%>
</Body>
</Html>

Test

The first appearance is DOM (JAXP Crimson parser)

DOM is the official W3C standard for XML documents in a way unrelated to the platform and language. DOM is a collection of nodes or information fragments organized by hierarchies. This hierarchy allows developers to search for specific information in the tree. To analyze this structure, you usually need to load the entire document and construct a hierarchy before you can do any work. Because it is based on information layers, DOM is considered to be tree-based or object-based. DOM and tree-based processing in the broad sense have several advantages. First, because the tree is persistent in the memory, you can modify it so that the application can change the data and structure. It can also navigate up and down the tree at any time, rather than one-time processing like SAX. DOM is much easier to use.

On the other hand, parsing and loading a very large document may be slow and resource-consuming, so it is better to use other methods to process such data. These event-based models, such as SAX.

Bean file:

Package com. test;

Import java. io .*;
Import java. util .*;
Import org. w3c. dom .*;
Import javax. xml. parsers .*;

Public class MyXMLReader {

Public static void main (String arge []) {
Long lasting = System. currentTimeMillis ();
Try {
File f = new File ("data_10k.xml ");
DocumentBuilderFactory factory = DocumentBuilderFactory. newInstance ();
DocumentBuilder builder = factory. newDocumentBuilder ();
Document doc = builder. parse (f );
NodeList nl = doc. getElementsByTagName ("VALUE ");
For (int I = 0; I <nl. getLength (); I ++ ){
System. out. print ("license plate number:" + doc. getElementsByTagName ("NO"). item (I). getFirstChild (). getNodeValue ());
System. out. println ("owner address:" + doc. getElementsByTagName ("ADDR"). item (I). getFirstChild (). getNodeValue ());
}
} Catch (Exception e ){
E. printStackTrace ();
}
System. out. println ("Run time:" + (System. currentTimeMillis ()-lasting) + "millisecond ");
}
}

10 K elapsed time: 265 203 219 172
9172 K elapsed time: 9016 8891 9000
691719 K elapsed time: 675407 708375 739656
10000k time consumed: OutOfMemoryError

Followed by SAX

The advantages of this processing are very similar to those of streaming media. The analysis can start immediately, rather than waiting for all data to be processed. In addition, because the application only checks data when reading data, it does not need to store the data in the memory. This is a huge advantage for large documents. In fact, the application does not even have to parse the entire document; it can stop parsing when a condition is met. In general, SAX is much faster than its replacement DOM.

Select DOM or SAX?

For developers who need to write their own code to process XML documents, choosing DOM or the SAX Parsing Model is a very important design decision.

DOM uses a tree structure to access XML documents, while SAX uses an event model.

The DOM parser converts an XML document into a tree containing its content and can traverse the tree. The advantage of using DOM to parse the model is that programming is easy. Developers only need to call the build instruction and then use navigation APIs to access the desired Tree node to complete the task. You can easily add and modify elements in the tree. However, because the DOM parser needs to process the entire XML file, the performance and memory requirements are high, especially when a large XML file is encountered. Due to its traversal capability, DOM parser is often used in services that require frequent changes in XML documents.

The SAX Parser uses an event-based model. It triggers a series of events when parsing XML documents. When a given tag is found, it can activate a callback method, tell the method that the label has been found. The memory requirements of SAX are usually relatively low, because it allows developers to decide the tag to be processed by themselves. Especially when developers only need to process part of the data contained in the document, the extension capability of SAX is better reflected. However, it is difficult to use the SAX Parser to encode data, and it is difficult to access multiple different data in the same document at the same time.

Bean file:

Package com. test;
Import org. xml. sax .*;
Import org. xml. sax. helpers .*;
Import javax. xml. parsers .*;

Public class MyXMLReader extends DefaultHandler {

Java. util. Stack tags = new java. util. Stack ();

Public MyXMLReader (){
Super ();
}

Public static void main (String args []) {
Long lasting = System. currentTimeMillis ();
Try {
SAXParserFactory sf = SAXParserFactory. newInstance ();
SAXParser sp = sf. newSAXParser ();
MyXMLReader reader = new MyXMLReader ();
Sp. parse (new InputSource ("data_10k.xml"), reader );
} Catch (Exception e ){
E. printStackTrace ();
}
System. out. println ("Run time:" + (System. currentTimeMillis ()-lasting) + "millisecond ");
}

Public void characters (char ch [], int start, int length) throws SAXException {
String tag = (String) tags. peek ();
If (tag. equals ("NO ")){
System. out. print ("license plate number:" + new String (ch, start, length ));
}
If (tag. equals ("ADDR ")){
System. out. println ("Address:" + new String (ch, start, length ));
}
}

Public void startElement (
String uri,
String localName,
String qName,
Attributes attrs ){
Tags. push (qName );
}
}

10 K elapsed time: 110 47 109 78
344 K elapsed time: 406 375 422
3234 K elapsed time: 3281 3688 3312
10000k consumption time: 32578 34313 31797 31890 30328

Then the JDOM http://www.jdom.org/

JDOM aims to become a Java-specific document model, which simplifies interaction with XML and is faster than DOM. Since JDOM is the first specific Java model, JDOM has been vigorously promoted and promoted. Considering using the Java specification request JSR-102 to ultimately use it as the Java standard extension ". JDOM development has started since the beginning of 2000.

JDOM and DOM are mainly different in two aspects. First, JDOM only uses a specific class instead of an interface. This simplifies APIs in some ways, but also limits flexibility. Second, the API uses a large number of Collections classes to simplify the use of Java developers who are already familiar with these classes.

The purpose of the JDOM Document declaration is to "use 20% (or less) effort to solve 80% (or more) Java/XML problems" (assumed as 20% based on the learning curve ). JDOM is certainly useful for most Java/XML applications, and most Developers find that APIs are much easier to understand than DOM. JDOM also includes extensive checks on program behavior to prevent users from doing anything meaningless in XML. However, it still requires you to fully understand XML in order to do more than basic work (or even understand errors in some situations ). This may be more meaningful than learning DOM or JDOM interfaces.

JDOM does not contain a parser. It usually uses the SAX2 parser to parse and verify the input XML document (although it can also use the previously constructed DOM Representation as the input ). It contains some converters that output the JDOM representation into the SAX2 event stream, DOM model, or XML text document. JDOM is an open source code released under the Apache license variant.

Bean file:

Package com. test;

Import java. io .*;
Import java. util .*;
Import org. jdom .*;
Import org. jdom. input .*;

Public class MyXMLReader {

Public static void main (String arge []) {
Long lasting = System. currentTimeMillis ();
Try {
SAXBuilder builder = new SAXBuilder ();
Document doc = builder. build (new File ("data_10k.xml "));
Element foo = doc. getRootElement ();
List allChildren = foo. getChildren ();
For (int I = 0; I <allChildren. size (); I ++ ){
System. out. print ("license plate number:" + (Element) allChildren. get (I). getChild ("NO"). getText ());
System. out. println ("owner address:" + (Element) allChildren. get (I). getChild ("ADDR"). getText ());
}
} Catch (Exception e ){
E. printStackTrace ();
}
System. out. println ("Run time:" + (System. currentTimeMillis ()-lasting) + "millisecond ");
}
}

10 K elapsed time: 125 62 187 94
704 K elapsed time: 625 640 766
27984 K elapsed time: 30750 27859 30656
10000k time consumed: OutOfMemoryError

Finally DOM4J http://dom4j.sourceforge.net/

Although DOM4J represents completely independent development results, it was originally a smart branch of JDOM. It combines many functions beyond the representation of basic XML documents, including integrated XPath support, XML Schema support, and event-based processing for large or streaming documents. It also provides the option to build document representation. It provides parallel access through the DOM4J API and standard DOM interface. It has been under development since the second half of 2000.

To support all these functions, DOM4J uses interfaces and abstract basic class methods. DOM4J uses a large number of Collections classes in APIs, but in many cases, it also provides alternative methods to allow better performance or more direct encoding methods. The direct advantage is that although DOM4J pays for more complex APIs, it provides much greater flexibility than JDOM.

When adding flexibility, XPath integration, and processing large documents, DOM4J has the same goals as JDOM: ease of use and intuitive operations for Java developers. It is also committed to becoming a more complete solution than JDOM to achieve the goal of essentially handling all Java/XML problems. When this goal is achieved, it places less emphasis on preventing incorrect application behavior than JDOM.

DOM4J is a very good Java xml api with excellent performance, powerful functionality and extreme ease of use. It is also an open source software. Now you can see that more and more Java software are using DOM4J to read and write XML. It is particularly worth mentioning that Sun's JAXM is also using DOM4J.

Bean file:

Package com. test;

Import java. io .*;
Import java. util .*;
Import org. dom4j .*;
Import org. dom4j. io .*;

Public class MyXMLReader {

Public static void main (String arge []) {
Long lasting = System. currentTimeMillis ();
Try {
File f = new File ("data_10k.xml ");
SAXReader reader = new SAXReader ();
Document doc = reader. read (f );
Element root = doc. getRootElement ();
Element foo;
For (Iterator I = root. elementIterator ("VALUE"); I. hasNext ();){
Foo = (Element) I. next ();
System. out. print ("license plate number:" + foo. elementText ("NO "));
System. out. println ("owner address:" + foo. elementText ("ADDR "));
}
} Catch (Exception e ){
E. printStackTrace ();
}
System. out. println ("Run time:" + (System. currentTimeMillis ()-lasting) + "millisecond ");
}
}

10 K elapsed time: 109 78 109 31
297 K elapsed time: 359 172 312
2281 K elapsed time: 2359 2344 2469
10000k consumption time: 20938 19922 20031 21078

JDOM and DOM do not perform well in performance tests, and memory overflow occurs when testing 10 M documents. DOM and JDOM are also worth considering in the case of small documents. Although JDOM developers have already stated that they want to focus on performance issues before the official release, from the performance perspective, it is indeed not recommendable. In addition, DOM is still a good choice. DOM implementation is widely used in multiple programming languages. It is also the basis of many other XML-related standards, because it is officially recommended by W3C (relative to a non-standard Java model ), so it may also be required in some types of projects (such as using DOM in JavaScript ).

SAX performs well, depending on its specific parsing method. A sax detects the upcoming XML Stream but does not load it into the memory (of course, some documents are temporarily hidden in the memory when the XML Stream is read ).

Undoubtedly, DOM4J is the winner of this test. Currently, many open-source projects use DOM4J in large numbers. For example, the famous Hibernate also uses DOM4J to read XML configuration files. If portability is not considered, use DOM4J!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Four XML parsing techniques in Java

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support