Research on the characteristics and performance of XML document model in Java

Source: Internet
Author: User
Tags abstract format file size implement object model version access websphere application server
xml| Performance

XML in Java: Document model, Part I: Performance

Research on the characteristics and performance of XML document model in Java

Document Options

Send this page as an e-mail message


Latest recommendations

Java Application Development Source power-Download free software, fast start development



Level: Primary



Dennis M. Sosnoski, President, Sosnoski Software Solutions, Inc.



September 01, 2001


in this article, the Java Advisor Dennis Sosnoski compares the performance and functionality of several Java document models. When choosing a model, it is not always easy to make trade-offs, and if you change your mind later, you need a lot of coding to switch. The author puts the performance results into the context of the attribute set and follows the criteria, giving some advice on the correct choice required. This article contains several charts and source code for this set of tests.


Java developers using in-memory XML documents can choose to use a standard DOM representation or any of several Java-specific models. This flexibility has helped to build Java into an excellent platform for XML work. However, as the number of different models increases, it is more difficult to determine how to compare the functionality, performance, and ease of use of the model.



about using the first article in the "XML in Java" series to study the characteristics and performance of some of the leading XML document models in Java. It includes the results of a set of performance tests (with downloadable test code, see Resources ). The second article in the series will examine usability issues by comparing sample code for the different models used to implement the same tasks.



Document model



The number of available document models in Java has been increasing. For this article, I've covered the most commonly used models and several options, which illustrate the particularly interesting features that may not be widely understood or used. As the importance of the XML namespace increases, I have included a model that supports only this feature. The following is a list of models with brief introductions and version information.



Only to illustrate the terminology used in this article:


    • A parser is a program that interprets the structure of an XML text document
    • A document representation is a data structure that a program uses for memory files
    • document Model refers to libraries and APIs that support the use of document representations


Some XML applications do not need to use the document model at all. If an application can  it needs through a single traversal of the document, the parser may be used directly. This method may require some additional effort, but its performance is always better than building a document representation in memory.



Dom



The DOM ("Document Object Model") is the official web-consortium standard for representing XML documents in a platform-and language-independent way. It is a good contrast for any Java-specific model. To be worthy of separation from the DOM standard, a Java-specific model should offer superior performance and/or ease of use over the Java DOM.



The DOM definition leverages the interface and inheritance of different components of an XML document. This gives developers the advantage of using a common interface for several different types of components, but adds complexity to the API. Because the DOM is language-independent, the interface does not need to take advantage of public Java components, such as the collections class.



This article covers two DOM implementations: Crimson and Xerces Java. Crimson is an Apache project based on the Sun project X parser. It merges a complete validation parser that contains DTD support. The parser can be accessed through the SAX2 interface, and the DOM implementation can work with other SAX2 parsers. Crimson is the open source that is released under the Apache license. The version used for performance comparisons is the Crimson 1.1.1 (the jar file size is 0.2MB), which contains a SAX2 parser that is built from the DOM of the text file.



Another test of the DOM implementation, that is, Xerces Java is another Apache project. Initially, Xerces is based on the IBM Java parser (often called xml4j). (Xerces Java 2, which is currently in the early beta release, will eventually inherit it.) The current version is sometimes called Xerces Java 1. As with Crimson, the Xerces parser can be accessed through the SAX2 interface and DOM. However, Xerces does not provide any way to use the Xerces DOM with a different SAX2 parser. Xerces Java contains validation support for DTDs and XML schemas (with only minimal restrictions on schema support).



Xerces Java also supports the DOM's deferred node extensions (refer to the deferred Xerces or Xerces def in this article). , where the document component is initially expressed in a compressed format and is expanded into a full DOM representation only when used. This approach is intended to allow for faster parsing and lower memory usage, especially for applications that may use only a partial input document. Similar to crimson, Xerces is an open source issued under the Apache license. The version used for performance comparisons is Xerces 1.4.2 (the jar file size is 1.8MB).



Jdom



The goal of JDOM is to become a Java-specific document model that simplifies interaction with XML and is faster than using DOM. Because it is the first Java-specific model, JDOM has been vigorously promoted and promoted. is considering using the Java specification Request JSR-102 to eventually use it as a "java standard extension". Although the actual format is still under development, there are significant changes to the JDOM API for the two beta releases. JDOM has been developed since the beginning of the 2000.



There are two main differences between JDOM and DOM. First, JDOM uses only specific classes instead of interfaces. This simplifies the API in some ways, but it also limits flexibility. Second, the API uses a lot of collections classes, simplifying the use of Java developers who are already familiar with these classes.



The JDOM document declares that its purpose is "to use 20% (or less) energy to solve 80% (or more) java/xml problems" (assuming 20% according to the learning curve). JDOM is of course useful for most java/xml applications, and most developers find APIs much easier to understand than DOM. JDOM also includes a fairly extensive review of program behavior to prevent users from doing anything that is meaningless in XML. However, it still requires you to fully understand XML in order to do something beyond the basics (or even to understand some cases of errors). This may be more meaningful than learning a DOM or JDOM interface.



The JDOM itself does not contain a parser. It typically uses the SAX2 parser to parse and validate the input XML document (although it can also represent the previously constructed DOM as input). It contains converters to output JDOM representations as SAX2 event streams, DOM models, or XML text documents. JDOM is the open source that is released under the Apache license variant. The version used for performance comparisons is the JDOM Beta 0.7 (JAR file size is 0.1MB) with a crimson SAX2 parser for building JDOM representations from a text file.



dom4j



Although DOM4J represents a completely independent development result, it was initially an intelligent branch of JDOM. It incorporates a number of features beyond the representation of basic XML documents, including integrated XPath support, XML Schema support (currently alpha format), and event-based processing for large documents or streaming documents. It also provides the option to build the document representation, which has parallel access through the dom4j API and the standard DOM interface. Since the second half of 2000, it has been under development, preserving the existing APIs between the most recent distributions.



To support all of these features, DOM4J uses interfaces and abstract basic class methods. DOM4J uses the collections class in the API heavily, but in many cases it also provides workarounds to allow for better performance or more direct coding methods. The immediate benefit is that although DOM4J has paid the cost of more complex APIs, it provides much greater flexibility than JDOM.



When adding flexibility, XPath integration, and goals for large document processing, DOM4J's goal is the same as JDOM: ease of use and intuitive operation for Java developers. It is also dedicated to becoming a more complete solution than JDOM, achieving the goal of essentially addressing all java/xml issues. When this goal is completed, it is less stressed than JDOM to prevent improper application behavior.



DOM4J uses the same method as the JDOM output, depending on the SAX2 parser input processing, relying on the converter to process the output into a SAX2 event stream, a DOM model, or an XML text document. DOM4J is an open source published under the BSD style license, which is essentially equivalent to Apache style licenses. The version used for performance comparisons is the DOM4J 0.9 (JAR file size is 0.4MB) with the bound aelfred SAX2 parser that is used to build representations from a text file (because SAX2 option settings, one of the test files cannot be used by DOM4J for the same JDOM test Crim Son SAX2 parser to handle).



Electric XML



Electric XML (EXML) is a subsidiary product of a business project that supports distributed computing. It differs from the other models discussed so far in that it can only properly support a subset of XML documents, it does not provide any support for validation and has more stringent licenses. However, the advantage of EXML is that it has a small size and provides direct support for the XPath subset, since it has been promoted as an alternative model for other models in recent articles, making it a compelling candidate through this comparison.



Although EXML achieves some of the same effects by using abstract basic class methods, it uses a similar approach to JDOM in avoiding interfaces (the main difference being that interfaces provide greater flexibility for extension implementations). It differs from the JDOM in that it avoids the use of collections classes. This combination provides a very simple API, in many respects similar to a simplified version of the DOM API with additional XPath operations.



EXML preserves whitespace in the document only when whitespace is adjacent to the content of the Non-white text, which restricts EXML to a subset of the XML document. Standard XML requires that whitespace be retained when reading a document, unless the document DTD can confirm that there is no white space. The EXML method works well for many XML applications that already know nothing about whitespace, but it prevents the use of EXML for documents that expect to remain blank (for example, applications that generate documents that are displayed or viewed by browsers). (for the author's humble advice on the subject, see the secondary column for the purpose of using whitespace .) )



This white-space deletion can be misleading for performance comparisons-many types of test scopes are proportional to the number of components in the document, and each blank sequence deleted by EXML is a component in the other model. EXML is included in the results shown in this article, but keep this in mind when interpreting performance differences.



EXML uses an integrated parser to construct a document representation based on a text document. In addition to text, it does not provide any way to convert or convert from a DOM (or SAX2) to a SAX2 (or DOM) event stream. EXML is open source issued by Mind Electric under a restricted license that prohibits embedding it in certain types of applications or libraries. The version used for performance comparisons is the Electric XML 2.2 (JAR file size is 0.05MB).



XML Pull Parser



The XML Pull Parser (XPP) was recently developed to demonstrate different ways of parsing XML. As with EXML, XPP can only properly support subsets of XML documents and do not provide any support for validation. It also has the advantage of small size. This advantage is then combined with the pull back parser method, making it a good replacement for the comparison.



XPP uses interfaces almost exclusively, but it uses only a small portion of all classes. As with EXML, XPP avoids using the collections class in the API. In general, it is the simplest document model API in this article.



The limitation of restricting XPP to a subset of XML documents is that it does not support entities, comments, or processing indication information in documents. XPP creates a document structure that contains only elements, attributes (including "namespaces"), and content text. This is a very strict limitation for some types of applications. However, it usually has less effect on performance than EXML whitespace processing. In this article I use only one test file that is incompatible with XPP, and the XPP result is displayed in a annotated chart that does not contain the file.



Pull-back parser support in XPP (called XPP pull back in this article) does this by deferring parsing to one of the components of the document, and then parsing the document as needed to construct that component. This technique would like to implement a very fast document display or classification application, especially if a document needs to be forwarded or dropped (rather than fully parsing and processing the document). The use of this method is optional, and if XPP is used in a non pull-back mode, it parses the entire document and constructs a complete representation at the same time.



Like EXML, XPP uses an integrated parser for building documents from a text document, and it does not provide any way to convert from the DOM (or SAX2) to a SAX2 (or DOM) event stream, except through text. XPP is an open source with Apache-style licenses. The version used for performance comparisons is Pullparser 2.0.1 Beta 8 (JAR file size is 0.04MB).


Test Details

The timings shown are from the use of the Sun Microsystems Java version 1.3.1, the Java HotSpot Client VM 1.3.1-b24 test, which runs in the Athlon 1GHz system with 256MB RAM Redhat Linux 7.1 under the EC. Set the initial JVM and maximum memory size for these tests to 128MB, which I would like to represent as the server type execution environment.

In tests with the initial default JVM memory set to 2MB and maximum memory 64MB, the results of models with larger jar file sizes (DOM, JDOM, and dom4j) are poor, especially in the average time that the tests are run. This could be caused by an invalid operation of the HotSpot JVM that was run with limited memory.

Two of the document models (XPP and EXML) support the direct entry of a document into a "string" or a character array. This type of direct input does not represent the actual application, so I avoid using it in these tests. For input and output, I use Java streaming to encapsulate bytes to eliminate I/O effects on performance, while retaining the language interface used by applications that are used in XML document input and output in typical cases.





Back to the top of the page



Performance comparison



The performance comparisons used in this article are based on parsing and using a set of selected XML documents that attempt to represent a larger range of applications:


    • Much_ado.xml, a Shakespeare play marked as XML. There are no attributes and are fairly simple structures (202K bytes).
    • Periodic.xml, the periodic table of elements in XML. Some properties, which are fairly simple (117K bytes).
    • Soap1.xml, taken from the canonical sample SOAP document. A large number of namespaces and attributes (0.4K bytes, each test needs to be repeated 49 times).
    • Soap2.xml,soap the list of values in the document format. A large number of namespaces and attributes (134K bytes).
    • Nt.xml, a "New Testament" marked as XML. There are no attributes and very simple structures, mass text content (1047K bytes).
    • The Xml.xml,xml specification, with no DTD reference, defines all entities internally. A text style tag with a large amount of mixed content, some attributes (160K bytes).


For more information on the test platform, see the sidebar test details and see Resources for a link to the source code used for testing.



In addition to very small soap1.xml documents, all evaluation times refer to the time that each specific test of the document has undergone. In the case of Soap1.xml, the time to evaluate is 49 consecutive document tests (a total of enough copies of 20K bytes of text).



The test framework runs a specific test multiple times (shown here 10 times) to track the minimum time and average time of the test, and then continue to the next test on the same document. After you complete the test sequence for a document, it repeats the process for the next document. To prevent interaction between document models, test only one model when executing each test framework.



HotSpot and timing benchmark programs that are similar to dynamic tuning JVMs are tricky; small changes in the test sequence often result in significant changes in timing results. I've found that this is true for the average time to execute a particular piece of code, and the shortest time is the same as the values I've listed in these results. You can see comparisons between the average and the shortest time of the first Test (document build time).



Document Build Time



The document build time test checks the time that is required to parse a text document and construct a document. For comparison purposes, the SAX2 parsing time with Crimson and Xerces SAX2 resolution has been included in the chart because most document models (all documents except EXML and XPP) use the SAX2 parsing event flow as input to the document construction process. Figure 1 depicts the test results.



Figure 1. Document Build Time


For most test documents, the build time for the XPP is too short to compute (because in this case, the document is not actually parsed), only the very short soap1.xml is displayed. For this file, pulling back the parser memory size and associated creation overhead makes XPP appear relatively slow. This is because the test program creates a new pull-back resolver copy for each copy of the document that is being parsed. In the case of Soap1.xml, 49 copies are used for each evaluation time. The cost of allocating and initializing these parser instances is greater than the time required to repeatedly parse the text and build most of the other methods represented in the document.



In an e-mail discussion, XPP's author points out that in a real-world application, you can use the same pull back to the parser instance to reuse it. If you do this, the cost of soap1.xml files will be significantly reduced to negligible levels. For larger files that do not even need to be pooled, the pull back parser creation overhead can also become negligible.



In this test, XPP (with full parsing), Xerces and dom4j with deferred node creation show the same performance as a whole. Deferred Xerces are especially good for larger documents, but they are expensive for smaller documents--and even more expensive than regular Xerces DOM. The cost of delaying node-creation methods is also high when you first use a portion of the document, which reduces the benefits of quick parsing.



For smaller soap1.xml files, the Xerces of all formats (SAX2 parsing, regular DOM, and latency DOM) is expensive. This is especially good for the file XPP (full parsing), and for Soap1.xml,exml even over SAX2 based models. Although EXML has the advantage of discarding separate blank content, it is, in general, the worst of the tests.



Document Traversal time



The document traversal time test checks the time that is required to traverse the constructed document, traversing each element, attribute, and text content segment in document order. It attempts to represent the performance of the document model interface, which may be important for applications that repeatedly access information from documents that have been parsed. In general, the traversal time is much faster than the parsing time. For applications that traverse only the parsed document, the parsing time is more important than the traversal time. Figure 2 shows the results.



Figure 2. Document Traversal time


In this test, the performance of XPP is much greater than the rest of the test objects. The Xerces DOM takes about twice times as much time as the XPP. Although EXML has the advantage of discarding separate blank content in a document, EXML spends almost three times times as much as XPP. DOM4J is in the middle of this picture.



When you use the XPP pull back, parsing of the document text does not really occur until you access the document representation. This results in a very large overhead when traversing the document representation for the first time (not shown in the table). If you later access the entire document representation, the XPP displays a net loss of performance when you use the pull back parsing method. For a pull-back parser, the total time required for the first two tests is longer than the regular resolution using XPP (20% to 100%, depending on the document). However, the pull back parser method still has considerable performance benefits when the document being parsed is not fully accessible.



Xerces with deferred node creation shows similar behavior, which results in performance degradation when first accessing document representations (not shown in the figure). However, in the case of Xerces, the node creation cost is approximately the same as the performance difference created by the regular DOM during parsing. For larger documents, the total time required for the first two Tests with Xerces latency is roughly the same as the time spent using a Xerces built with a regular DOM. If you use Xerces on a very large document (possibly 10KB or greater), the deferred node creation seems like a good choice.



Document modification Time



This test checks the time required to systematically modify the construction document representation, and the results are shown in Figure 3 . It iterates through the representations, deletes all the individual blank content, and encapsulates each of the non-white-space content strings with the newly added elements. It also adds an attribute to each element of the original document that contains the non-whitespace content. This test attempts to represent the performance of the document model after a certain range of document modifications. Like the traversal time, the modification time is much shorter than the parsing time. Therefore, parsing time is more important for applications that traverse each parsed document only a single time.



Figure 3. Document Modification Time


EXML is a leader in this test, but it has a performance advantage over other models because it always discards separate blank content during parsing. This means that there is no content to be removed from the EXML representation during the test.



In modifying performance, XPP is second only to EXML and, unlike EXML, XPP tests contain deletions. The Xerces Dom and dom4j are close to the middle, and the JDOM and Crimson DOM models are still the least performance.



Document Generation Time



This test checks the time it takes for the document to be represented as a text XML document, and the result is shown in Figure 4 . For any application that does not specialize in XML documents, this step seems to be an important part of overall performance, especially since the time required to export the document to text is generally close to the time required to parse the document input. To make these times directly comparable, the test uses the original document instead of the modified document that was generated by the previous test.



Figure 4. Text Generation time


The text generation time test shows that the differences between the models are smaller than the previous test, Xerces DOM performance is the best, but the lead is not much, JDOM performance is the worst. EXML performance is better than JDOM, but this is also due to EXML discarded blank content.



Many models provide the option to control the format of text output, and some options appear to affect the text generation time. This test uses only the most basic output format for each model, so the results show only default performance and do not show the best possible.



Document Memory Size



This test checks the memory space used for the document representation. This is especially important for developers who use large documents or use multiple smaller documents at the same time. Figure 5 shows the results of this test.



Figure 5. Document Memory Size


The memory size result differs from the timing test because the small soap1.xml file displays a value that represents a single copy of the file and does not represent the 49 copies used in the timing evaluation. In most models, the memory used for the digest document is too small to be displayed on the scale of the diagram.



In addition to the XPP pull back (until you actually build the document representation when you access it), the differences between the models in the memory size test are relatively small compared to the differences shown in some timing tests. The deferred Xerces has the most compact representation (extending it to the base Xerces size when the first access is represented), followed by dom4j. Although EXML discards whitespace contained in other models, it still has the least compact representation.



Because even the most compact model takes up about four times times the size of the original document text (in bytes), all models seem to require too much memory for large documents. By providing methods that use part of the document representation, XPP and dom4j provide the best support for very large documents. XPP completes the task by building only the representation portion of the actual access, while DOM4J includes support for event-based processing that allows you to build or process only part of a document at a time.



Java serialization



These tests evaluate the time and output size of the Java serialization represented by the document. This is primarily concerned with applications that use Java RMI ("remote Method calls") to transfer representations between Java programs, including EJB (Enterprise JavaBean) applications. In these tests, only the models that support Java serialization are included. The following three diagrams show the results of this test.



Figure 6. Serialization of output Time


Figure 7. Serialization Input Time


Figure 8. Serialization of document Size


The dom4j shows the best serialization performance of the output (generating serialized format) and the input (rebuilding the document from the serialized format), while the Xerces DOM shows the worst performance. The time spent by EXML is close to dom4j, but EXML still has the advantage of using fewer objects in the representation because it discards whitespace content.



If you output the document as text and then parse to reconstruct the document instead of using Java serialization, all performance-time and size-will be much better. The problem here is the structure of the XML document representation as a large number of unique small objects. Java serialization cannot effectively handle this type of structure, which results in a high overhead for both time and output size.



You can design a document serialization format that is smaller than text representation and faster than text input and output, but can only be done by bypassing Java serialization. (I have a project that implements this custom serialization of XML documents and can find its open source on our Web site, see Resources .) )






Back to the top of the page



Conclusion



Different Java XML document models have their own strengths, but from a performance standpoint, some models have obvious advantages.



In most respects, XPP performance is a leading position. Although XPP is a new model, it is a great choice for middleware-type applications that do not require validation, entity, process indication information, or annotations. It is especially useful for applications that run as small browser applications or in memory-constrained environments.



Although DOM4J does not have the same speed as XPP, it does provide a more standardized and full-featured implementation, including built-in support for SAX2, DOM, and even XPath. Although the Xerces DOM (with deferred node creation) is poorly performing for small files and Java serialization, it is still outstanding in most evaluations. For regular XML processing, both dom4j and Xerces DOM are good choices, depending on whether you think Java-specific features are more important or cross language compatibility is more important.



The JDOM and Crimson DOM have been performing poorly during performance testing. It is also worth considering using the Crimson DOM in small document situations, and the Xerces performance is poor. While JDOM developers have indicated that they expect to focus on performance issues before the formal release, it does not have a merit to recommend from a performance standpoint. However, if the API is not rebuilt, JDOM may struggle to match the performance of other models.


Purpose of using whitespace

XML specifications often need to preserve whitespace, but many XML applications use a format that preserves whitespace only for readability. For these applications, the EXML method of discarding the isolation whitespace works.

Most of the documents used in these performance categories belong to the "whitespace reserved for readability" category. These documents are formatted in a form that is convenient for people to view, with a maximum of one element in a row. As a result, the number of extraneous blank content strings actually exceeds the number of elements in the document. This greatly increases the unnecessary overhead of processing each step.

The option to support trimming this type of whitespace on the input will help improve the performance of all document models with an ignored blank application (except EXML). As long as pruning is an option, it does not affect applications that need to be completely blank. Parser-level support will be better because the parser must process the input characters individually. In summary, this type of option will be very helpful to many XML applications.


The EXML is very small (in jar file size) and performs well in some performance tests. Although EXML has the advantage of removing individual whitespace content, it is less performance than XPP. XPP may be a better choice in a memory-constrained environment unless you need EXML support and a XPP missing feature.



Although dom4j performance is best, there is currently no model that can provide good performance for Java serialization. If you need to pass documents between programs, the best choice is to write the document in text and then parse it to reconstruct the representation. In the future, custom serialization formats may provide a better choice.






Back to the top of the page



Follow-up content ...



I've covered some of the basic features of the document model and have shown the performance metrics for several types of document operations. Keep in mind, though, that performance is just one factor in selecting a document model. For most developers, usability is at least as important as performance, and these models use different APIs that might have a reason to like it rather than that.



In subsequent articles, you will focus on usability, where I'll compare sample code to complete the same operations in these different models. Please check the second part of this comparison. When you wait, you can use the links in the following forums to present your comments and questions about this article to share with you.






Back to the top of the page



Resources


  • You can refer to the English version of this article at the DeveloperWorks Global site.

  • Participate in the forum on this article.

  • If you need background knowledge, try Developerworks's tutorials in XML Java programming , Understanding SAX , and understanding DOM.

  • Download the test program and document model library for this article from the download page .

  • Review the updated test and test results on the home page of the test program.

  • Gets the details of the author's work on the XML serial (XMLS) encoding as an alternative to Java serialization.

  • Study or download the Java XML document model discussed in this article:
    • Xerces Java
    • Crimson
    • JDOM
    • dom4j
    • Electric XML (EXML)
    • XML Pull Parser (XPP)

  • The IBM WebSphere application Server contains xml4j parsers based on Xerces Java. How-to information about product XML support can be found in the was Advanced Edition 3.0 online documentation .




Back to the top of the page



About the author


Dennis Sosnoski ( dms@sosnoski.com) is Seattle regional Java consulting company Sosnoski Software Solutions, Inc. Founder and Chief advisor. With more than 30 years of professional software development experience, he has focused on server-side Java technology in recent years, including Servlet, Enterprise JavaBeans, and XML. He has demonstrated Java performance problems and regular server-side Java technology many times, and he is the chairman of Seattle Java-xml SIG .



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.