XML in Java: document model, Part 1: Performance

Source: Internet
Author: User
Tags websphere application server
XML in Java:Document model, Part 1: Performance Original ENGLISH
Content:
document model
Dom
JDOM
dom4j
electric XML
XML pull parser
Test details
Performance Comparison
document time
document traversal time
document modification time
text generation time
document memory size
JAVA serialization
conclusion
subsequent content...
references
about the author
comment on this article
related content:
Tutorial: Understanding sax
Tutorial: Understanding Dom
Tutorial: XML programming in Java
In addition, in the Web service area:
Tutorial
Tools and products
Article

Study the characteristics and performance of the XML document model in Java

Dennis M. sosnoski (dms@sosnoski.com)

President, sosnoski software solutions, Inc.

September 2001

In this article, Java consultant Dennis sosnoski compares the performance and functions of several Java document models. When selecting a model, we cannot make it clear every time. If we change our mind later, we need a lot of encoding to switch. The author puts the performance results in the context of the feature set and follows the standards, and gives some suggestions on the correct selection. This article contains several charts andSource code.

Java developers who use XML documents in memory can choose to use standard DOM Representation or any of several Java-specific models. This flexibility has helped build Java into an excellent platform for XML work. However, as the number of models increases, it is even more difficult to determine how to compare the functions, performance, and ease of use of the model.

The first article in the "XML in Java" series studies the features and performance of some leading XML document models in Java. It includes a set of performance test results (with downloadable testsCode, See references ). The second article in the series will examine usability issues by comparing sample code of different models used to implement the same task.

Document Model

The number of available document models in Java has been increasing. For this article, I have covered the most common models and several options, which demonstrate the characteristics that are particularly interesting that may not be widely understood or used. As XML Namespaces become more important, I have included models that only support this feature. The models with brief introductions and version information are listed below.

It only describes the terms used in this article:

    • ParserIt refers to the structure of the XML text document.Program
    • Document RepresentationIt refers to the data structure of documents used by the program in the memory.
    • Document ModelIt refers to the libraries and APIs that support the use of documents.

Some XML applications do not need to use the document model at all. If an application can collect the information required by a document traversal, it may directly use a parser. This method may require more work, but its performance is always better than building a document representation in the memory.

Dom

Dom ("Document Object Model") is an official W3C standard for expressing XML documents in a way unrelated to the platform and language. It is a good reference for any Java-specific model. To ensure separation from Dom standards, Java-specific models should provide better performance and/or ease-of-use advantages than Java Dom implementation.

Dom definition fully utilizes the interfaces and Inheritance of different components in XML documents. This gives developers the advantage of using public interfaces for several different types of components, but also increases API complexity. Because Dom is language-independent, Apis do not need to use public Java components, such as the collections class.

This article involves two Dom implementations: crimson and xerces Java. Crimson is an Apache project based on the sun Project X parser. It combines a complete verification parser that contains DTD support. You can access the parser through the sax2 interface, and the DOM implementation can work with other sax2 Resolvers. Crimson is an open source code published under the Apache license. The version used for performance comparison is crimson 1.1.1 (the JAR file size is 0.2 MB), which contains the sax2 parser used to build from the DOM of the text file.

Another test of Dom implementation is that xerces Java is another Apache project. Initially, xerces is based on the IBM Java Parser (often called xml4j ). (The re-developed xerces Java 2, which is currently in the early beta version, will eventually inherit it. The current version is sometimes called xerces Java 1 .) Like using crimson, you can use the sax2 interface and Dom to access the xerces parser. However, xerces does not provide any method to use xerces Dom with different sax2 Resolvers. Xerces JAVA supports the verification of DTD and XML Schema (only with the minimum restrictions on schema support ).

Xerces Java also supports Dom extension of delayed nodes (refer toDelay xercesOrXerces def.), The document component is initially expressed in the compression format, which is extended to the complete DOM Representation only when used. This approach is intended to allow faster resolution and lower memory usage, especially for applications that may only use part of the input document. Similar to crimson, xerces is an open source code released under the Apache license. The version used for performance comparison is xerces 1.4.2 (the JAR file size is 1.8 MB ).

JDOM

JDOM aims to become a Java-specific document model, which simplifies interaction with XML and is faster than Dom. Since JDOM is the first specific Java model, JDOM has been vigorously promoted and promoted. Considering using the Java specification request JSR-102 to ultimately use it as the java standard extension ". Although the format used is still under development, the JDOM APIs of the two beta versions are greatly changed ,. JDOM development has started since the beginning of 2000.

JDOM and Dom are mainly different in two aspects. First, JDOM only uses a specific class instead of an interface. This simplifies APIs in some ways, but also limits flexibility. Second, the API uses a large number of collections classes to simplify the use of Java developers who are already familiar with these classes.

The purpose of the JDOM Document declaration is to "use 20% (or less) effort to solve 80% (or more) Java/XML problems" (assumed as 20% based on the learning curve ). JDOM is certainly useful for most Java/XML applications, and most Developers find that APIs are much easier to understand than Dom. JDOM also includes extensive checks on program behavior to prevent users from doing anything meaningless in XML. However, it still requires you to fully understand XML in order to do more than basic work (or even understand errors in some situations ). This may be more meaningful than learning Dom or JDOM interfaces.

JDOM does not contain a parser. It usually uses the sax2 parser to parse and verify the input XML document (although it can also use the previously constructed DOM Representation as the input ). It contains some converters that output the JDOM representation into the sax2 event stream, Dom model, or XML text document. JDOM is an open source code released under the Apache license variant. The version used for performance comparison is JDOM beta 0.7 (the JAR file size is 0.1 MB). It has a crimson sax2 parser used to build a JDOM representation from a text file.

Dom4j

Although dom4j represents completely independent development results, it was originally a smart branch of JDOM. It combines many functions that exceed the representation of basic XML documents, including integrated XPath support and XML Schema support (currently in Alpha format) and event-based processing for large or streaming documents. It also provides the option to build document representation. It provides parallel access through the dom4j API and standard DOM interface. Since the second half of 2000, it has been under development and retains the existing APIs between the most recent releases.

To support all these functions, dom4j uses interfaces and abstract basic class methods. Dom4j uses a large number of collections classes in APIs, but in many cases, it also provides alternative methods to allow better performance or more direct encoding methods. The direct advantage is that although dom4j pays for more complex APIs, it provides much greater flexibility than JDOM.

When adding flexibility, XPath integration, and processing large documents, dom4j has the same goals as JDOM: ease of use and intuitive operations for Java developers. It is also committed to becoming a more complete solution than JDOM to achieve the goal of essentially handling all Java/XML problems. When this goal is achieved, it places less emphasis on preventing incorrect application behavior than JDOM.

Dom4j uses the same method as JDOM output, which relies on the input processing of the sax2 parser and the conversion to process the output as a sax2 event stream, Dom model, or XML text document. Dom4j is an open source code published under the BSD style license. This license is essentially equivalent to the Apache style license. The version used for performance comparison is dom4j 0.9 (the JAR file size is 0.4 MB), with the bound Aelfred sax2 parser used to build the representation from a text file (due to the sax2 option setting, one of the test files cannot be processed by dom4j using the same crimson sax2 Parser for JDOM testing ).

Electric XML

Electric XML (exml) is a subsidiary of commercial projects that support distributed computing. Unlike other models discussed so far, it only supports a subset of XML documents as appropriate and does not provide any support for verification and has a stricter license. However, the advantage of exml is that it is small in size and provides direct support for the XPath subset, because it has been upgraded to an alternative model for other models in recent articles, therefore, this comparison makes it an attractive candidate.

Although exml achieves the same effect by using abstract basic class methods, however, it uses methods similar to JDOM in avoiding interfaces (the main difference is that interfaces provide more flexibility for extension implementation ). Unlike JDOM, It also avoids using the collections class. This combination provides a very simple API, which is similar to a simplified version of dom api with an added XPath operation in many aspects.

OnlyWhen the blank space is adjacent to the non-blank text content, exml is reserved in the document, which limits exml to a subset of the XML document. Standard XML must be left blank when reading the document, unless the document DTD can confirm that there is no blank. The exml method works well for many XML applications that have been known to be empty beforehand, but it prevents blank documents (for example, use exml to generate the application of the document displayed or viewed by the browser. (For more information about the author's modest suggestions on this topic, see the purpose of using blank spaces in the secondary column .)

This blank deletion will mislead performance-many types of test ranges are proportional to the number of components in the document, and each blank sequence deleted by exml is a component in other models. Exml is included in the results shown in this article, but remember this impact when interpreting performance differences.

Exml uses an integrated parser to build document representations based on text documents. Except in text mode, it does not provide any way to convert from Dom (or sax2) to sax2 (or DOM) event stream. Exml is an open source code published by mind electric under a restricted license that prohibits it from embedding certain types of applications or libraries. The version used for performance comparison is electric XML 2.2 (the JAR file size is 0.05 MB ).

XML pull parser

XML pull Parser (xpp) is recently developed and demonstrates different XML parsing methods. Like exml, xpp can only properly support a subset of XML documents without any support for verification. It also has the advantage of small size. This advantage is combined with the pull-back parser method to make it a good replacement for this comparison.

Xpp uses interfaces almost exclusively, but it only uses a small part of all classes. Like exml, xpp does not use the collections class in APIs. In general, it is the simplest document model API in this article.

The limitation of limiting xpp to a subset of XML documents is that it does not support objects, annotations, or instructions in documents. Xpp creates a document structure that only contains elements, attributes (including "namespaces"), and content text. This is a very strict limitation for some types of applications. However, it usually has less impact on performance than exml blank processing. In this article, I only used a test file that is not compatible with xpp and displayed the xpp result in a chart with annotations, which does not contain this file.

Support for the pull-back parser in xpp (referred to in this articleXpp pull backThe resolution is actually postponed until a component of the document is accessed, and the document is parsed according to the need to construct that component. This technology allows fast document display or classification, especially when documents need to be forwarded or removed (rather than completely parsed and processed. This method is optional. If xpp is used in a non-pull-back mode, it parses the entire document and constructs a complete representation at the same time.

Like exml, xpp builds an integrated syntax Parser for document representation based on text documents, and does not provide a syntax parser from Dom (or sax2) any way to convert or convert to a sax2 (or DOM) event stream. Xpp is open source code with an Apache style license. The version used for performance comparison is pullparser 2.0.1 beta 8 (the JAR file size is 0.04 MB ).

Test details

The displayed timing result is from the test using Sun Microsystems Java version 1.3.1 and Java hotspot client VM 1.3.1-B24, these software runs under RedHat Linux 256 on an athlon 1 GHz system with 7.1 mb ram. Set the initial JVM and maximum memory size for these tests to 128 MB. I want to express it as a server-type execution environment.

In a test where the initial default JVM memory is set to 2 MB and the maximum memory is 64 MB, a large JAR file size (DOM, JDOM, and dom4j) is included) the results of the model are very poor, especially in the average time of running the test. This may be caused by Invalid hotspot JVM operations with limited memory.

Two types of documents (xpp and exml) in the document model can be directly input into a "string" or a character array. Direct input of this type cannot represent the actual application, so I should not use it in these tests. For input and output, I use Java stream to encapsulate bytes to eliminate the impact of I/O on performance, the language interfaces used by applications for XML Document input and output are retained in typical cases.

Performance Comparison

The performance used in this article is based on parsing and using a group of selected XML documents. These documents try to represent a large range of applications:

    • Much_ado.xml: The Shakespeare drama marked as XML. No attribute and a fairly simple structure (202 KB ).
    • Periodic. xml: periodic table of elements in XML. Some attributes are also quite simple (117 KB ).
    • Soap1.xml, which is taken from the standard sample soap document. A large number of namespaces and attributes (0.4 kb, repeated 49 times each test ).
    • Soap2.xml: A list of values in the soap document format. A large number of namespaces and attributes (134 KB ).
    • NT. XML, marked as the "New Testament" of XML ". There is no attribute and a very simple structure, a large amount of text content (1047 KB ).
    • XML. XML, XML specification, without DTD reference, defines all objects internally. Text style tags with a large amount of mixed content, some attributes (160 KB ).

For more information about the test platform, see the test details in the secondary column and view references for links to the source code for testing.

Except for a very small soap1.xml document, all evaluation times refer to the time taken by each specific test of the document. In the case of soap1.xml, the evaluation time is 49 consecutive document tests (the total number is 20 KB of enough copies of the text ).

The test framework runs a specific test multiple times on a document (10 times are displayed here) and follows the test shortest time and average time accordingly, then proceed to the next test in the same document. After all the test sequences of a document are completed, it repeats the process for the next document. To prevent interaction between document models, only one model is tested when each test framework is executed.

Hotspot and the timing benchmark program similar to the dynamic optimization JVM are notoriously tricky; small changes in the test sequence often lead to great changes in the timing results. I have found that this is true for the average time of executing a specific code segment; the shortest time is consistent, which is exactly the value I listed in these results. See the comparison of average and shortest time for the first test (Document Build time.

Document Build time

Document Build time test check the time required to parse the text document and construct the document. For comparison purposes, sax2 resolution time using crimson and xerces sax2 resolution has been included in the chart, because most document models (all documents except exml and xpp) use sax2 to parse the event stream as the input to the document building process. Figure 1 describes the test results.

Figure 1. Document Build time

For most test documents, the xpp pull-back construction time is too short to be computed (because in this case, the document is not actually parsed), and only soap1.xml is displayed as very short. For this file, the size of the pulled-back parser memory and the related creation overhead make the xpp appear relatively slow. This is because the test program creates a new pull-back parser copy for each copy of the document being parsed. In the case of soap1.xml, 49 copies are used for each evaluation time. The overhead of allocating and initializing these parser instances is greater than the time required to repeatedly parse text and build most other methods represented in the document.

In an email discussion, the author of xpp pointed out that in actual applications, you can pull back the parser instance to reuse it. In this case, the overhead of the soap1.xml file will be reduced to a negligible level. For larger files, you do not even need to use them together. The overhead of pulling back the parser can also be ignored.

In this test, xpp (with complete parsing), xerces and dom4j created with delayed nodes both show the same overall performance. The delayed xerces is especially good for large documents, but the overhead for small documents is high-or even much higher than the conventional xerces Dom. When using part of the document for the first time, the cost of creating delayed nodes is also high, which reduces the advantage of fast resolution.

For small soap1.xml files, xerces in all formats (sax2 parsing, conventional Dom, and delayed DOM) are overhead. Xpp (completely parsed) is particularly outstanding for this file. For soap1.xml, exml may even exceed the sax2-based model. Although exml has the advantage of discarding separate blank content, it is the worst in this test.

Document Traversal Time

Document Traversal Time Test Check the document traversal structure shows the time required to traverse each element, attribute, and text content segment in document order. It tries to represent the performance of the document model interface, which may be important for applications that repeatedly access information from parsed documents. In general, the Traversal Time is much faster than the resolution time. For applications that only traverse documents that have been parsed at a time, the resolution time is more important than the Traversal Time. Figure 2 shows the result.

Figure 2. Document Traversal Time

In this test, the performance of xpp is much higher than that of other test objects. Xerces Dom takes about twice the time of xpp. Although exml has the advantage of discarding separate blank content in the document, exml takes almost three times as much time as xpp. Dom4j is in the middle of this figure.

When xpp is used for pulling back, parsing of the document text does not occur until the document is displayed. This results in a very high overhead when the document is retrieved for the first time (not shown in the table ). If you access the entire document later, xpp displays a net loss of performance when you use the pull-back resolution method. For the pull-back parser, the total time required for the first two tests is longer than the conventional resolution using xpp (20% to 100% long, depending on the document ). However, when you have not fully accessed the document being parsed, pulling back the parser method still boasts considerable performance advantages.

Xerces created with delayed nodes show similar behavior, resulting in performance degradation when you first access the document representation (not shown in the figure ). However, in xerces, the node creation overhead is about the same as the performance difference between the regular Dom creation during parsing. For large documents, the total time required for the first two tests with xerces latency is roughly the same as the time used to build xerces with conventional Dom. If xerces is used on a very large document (which may be 10 kb or larger), delayed node creation seems to be a good choice.

Document Modification time

This test checks the system to modify the construction document to indicate the time required, and the results are shown in figure 3. It indicates that all blank content is deleted and each non-blank content string is encapsulated with the newly added element. It also adds an attribute to each element of the original document that contains non-blank content. This test indicates the performance of the document model after a certain range of documents are modified. Like the Traversal Time, the modification time is much shorter than the resolution time. Therefore, resolution time is more important for applications that only traverse each parsed document at a time.

Figure 3. Document Modification time

In this test, exml is in the leading position, but because it always discards blank content during parsing, it has a performance advantage over other models. This means that there is no content to be deleted from the exml representation during the test.

In terms of performance modification, xpp is second only to exml, and unlike exml, xpp test contains deletion. Xerces Dom and dom4j are close to the center, and the performance of JDOM and crimson Dom models is still the worst.

Text generation time

This test shows the time required to output the document into a text XML document. The result is displayed in Figure 4. This step seems to be an important part of overall performance for any application that does not specifically use XML documents, this is especially because the time needed to output a document as text is generally close to the time required to parse the document input. For direct comparability of these times, the test uses the original document instead of the modified document generated by the previous test.

Figure 4. Text generation time

The text generation time test shows that the differences between models are less than those in the previous test. xerces Dom has the best performance, but not many leading and JDOM has the worst performance. Exml has better performance than JDOM, but this is also because exml discards blank content.

Many models provide options to control the text output format, and some options seem to affect the text generation time. This test only uses the most basic output format of each model. Therefore, only the default performance is displayed, but the best possible performance is not displayed.

Document memory size

This test checks the memory space used in the document. This is especially important for developers who use large documents or use multiple small documents at the same time. Figure 5 shows the test result.

Figure 5. Document memory size

The memory size result is different from the timing test because the value displayed in the small soap1.xml file represents a single copy of the file rather than the 49 copies used in the timing evaluation. In most models, the memory used for briefing documents is too small to be displayed on the scale of the graph.

In addition to xpp pulling back (the document representation is built only when it is accessed), the differences between models in the memory size test are relatively small compared to the differences shown in some timing tests. The delayed xerces has the most compact representation (it is extended to the basic xerces size when the first access representation), followed by dom4j. Although exml discards the blank content contained in other models, it still has the least compact representation.

Even the most compact model consumes about four times the size of the original document text (in bytes), so for large documents, all models seem to require too much memory. Xpp pull back and dom4j provide the best support for very large documents by providing methods represented in some documents. Xpp pulls back to complete this task by building only the actually accessed representation part, while dom4j supports event-based processing so that only one part of the document can be built or processed at a time.

Java serialization

These test evaluation documents indicate the Java serialization time and output size. This mainly involves applications that use Java RMI ("remote method call") to transmit representations between Java programs (including EJB (Enterprise JavaBean) applications. In these tests, only models that support Java serialization are included. The following three figures show the test results.

Figure 6. serialization Output Time

Figure 7. serialization input time

Figure 8. serialization document size

Dom4j shows the best serialization performance for output (generated serialization format) and input (re-built document from serialized format), while xerces Dom shows the worst performance. Exml takes close time to dom4j, but exml still has the advantage of using a small number of objects in representation because it discards blank content.

If the document is output into text and parsed to recreate the document, rather than using Java serialization, all performance-time and size-will be much better. The problem here is the structure represented by a large number of unique small objects in XML documents. Java serialization cannot effectively process this type of structure, resulting in high overhead of time and output size.

YesThe serialization format of the document is designed to be smaller than the text representation and faster than the text input and output, but can only be completed by bypassing Java serialization. (I have a project to implement this custom serialization of XML documents. I can find its open source code on my company's Web site. For details, see references .)

Conclusion

Different Java XML document models have their own strengths. However, from the performance perspective, some models have obvious advantages.

In most aspects, xpp performance is in the leading position. Although xpp is a new model, it is a good choice for middleware-type applications that do not require verification, entities, instructions, or annotations. It is especially suitable for applications that run as browser applications or in a memory-constrained environment.

Although dom4j does not provide the same speed as xpp, it does provide a more standardized implementation of superior performance and functionality, including built-in support for sax2, Dom, and even XPath. Although xerces dom (created with delayed nodes) has poor performance in small files and Java serialization, it is still outstanding in most evaluations. Dom4j and xerces Dom are good choices for conventional XML processing. The choice of them depends on whether Java-specific features are more important or cross-language compatibility is more important.

JDOM and crimson Dom have consistently performed poorly during performance tests. In the case of small documents, it is worth considering the use of the Crimson Dom, while xerces performs poorly. Although JDOM developers have already stated that they want to focus on performance issues before the official release, from the performance perspective, it is indeed not recommendable. However, without re-building APIs, JDOM may be difficult to match other models.

Purpose of using blank space

The XML specification usually needs to be left blank, but many XML applications use the blank format for readability only. Exml discards the blank isolation method for these applications.

Most of the documents used in these performances fall into the "blank space reserved for readability" category. These documents are formatted to make it easy for people to view. One line contains at most one element. As a result, the number of irrelevant blank content strings exceeds the number of elements in the document. This greatly increases the unnecessary overhead of processing each step.

The option supporting trim input for this type of white space will help improve the performance of all document models for applications with negligible white space (except exml ). As long as trim is an option, it does not affect applications that need to be completely blank. Parser-level support will be better, because the parser must process input characters one by one. In short, this type of option will be very helpful to many XML applications.

Exml is very small (in the unit of JAR file size) and performs well in some performance tests. Although exml has the advantage of deleting separate blank content, it is inferior to xpp in terms of performance. Unless you need exml support and xpp lacks a feature, xpp may be a better choice in a memory-constrained environment.

Although dom4j has the best performance, none of the current models can provide good performance for Java serialization. If you need to transfer documents between programs, the best choice is to write the documents into text and parse them to re-build the representation. In the future, custom serialization formats may provide a better choice.

Subsequent content...

I have covered the basic features of some document models and displayed performance evaluations for several types of document operations. Remember, though, performance is only a factor in choosing a document model. For most developers, availability is at least as important as performance, and these models use different APIs for reasons like this and dislike it.

In subsequent articles, we will focus on availability, where I will compare sample code used to complete the same operation in these different models. Check the second part of this comparison. When you are waiting, you can share your comments and questions about this article with us through the links in the following Forum.

References

    • participate in the Forum on this article.
    • If You Need background knowledge, try XML Java programming, understand sax, and understand DOM in the developerworks tutorial.
    • download the test program and document model library for this article from the download page.
    • check the updated test and test results on the home page of the test program.
    • get the author's detailed information about XML serial (xmls) encoding as an alternative to Java serialization.
    • research or download the Java XML document model discussed in this article:
      • xerces Java
      • crimson
      • JDOM
      • dom4j
      • electric XML (exml)
      • XML pull Parser (xpp)
    • the IBM WebSphere Application Server contains an xml4j parser Based on xerces Java. You can find the how-to information supported by product XML in was advanced edition 3.0 online documentation.
About the author

Dennis sosnoski (dms@sosnoski.com) is the creator and Chief Consultant of sosnoski software solutions, inc., a Java consulting firm in Seattle. He has over 30 years of professional software development experience. In recent years, he has concentrated on the server-side Java technology, including servlet, Enterprise JavaBeans, and XML. He has demonstrated many Java performance issues and Java technology on the general server side. He is also the chairman of the Seattle Java-XML Sig.


To top

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.