Improve performance in XML applications, part 1th

Source: Internet
Author: User
Tags hash xml parser

Today, XML plays an important role in many performance-critical scenarios. While many developers know how to write XML documents, XML schemas, or DTDs, some people may not realize that the performance of an XML application depends on some of the decisions made when constructing an XML document, and what attributes are set on the parser before parsing the XML document.

Many developers also know when to use SAX and when to use the DOM API. Typically, if there is not enough memory, and your application must handle larger documents, or you want to create your own representation in memory, you'd better use SAX instead of DOM. On the other hand, if your application needs to randomly access and modify document data, want to implement complex searches, or plan to traverse a document tree multiple times, you can use the DOM as much as possible. In this article, we will explain which SAX or DOM operations and features affect application performance and describe how to write the best performing applications.

Writing XML Documents

Developers who write XML documents can do a variety of things to improve the performance of XML applications.

Each XML document can specify a character encoding in an XML declaration. For optimal performance, you should use US ASCII ("Us-ascii") as the encoding when writing XML documents. Documents written in ASCII characters are the quickest to parse, because each character is definitely single-byte and can be mapped directly to the corresponding Unicode value. If the document is encoded with UTF-8, but contains only ASCII characters, some parsers (such as Xerces2) approach this document in much the same way as they do with US-ASCII-encoded equivalent XML documents. For documents that contain Unicode characters other than ASCII, the parser must read and convert multibyte sequences for each character. This conversion can lose a certain amount of performance. Because each character requires two bytes and assumes no substituted characters, the UTF-16 encoding reduces this performance loss to some extent. However, if you use UTF-16, the original document will be doubled in urine, so the document will take more time to parse.

You can also improve performance by reducing the number of new rows and spaces in the document. Typically, for editing purposes, developers organize documents into rows-for example, by using carriage returns (#xD) and wrapping (#xA). The XML parser must convert the two-character sequence #xD #xA and all #xD (not followed by #xA) to a single #xA character. This conversion is not without cost. The overall performance impact on the parsing process depends on the number of characters associated with the number of new rows in the document. The same is true for space use. When you add a space to a document, the parser handles more characters, which ultimately affects the performance of the parsing process.

In addition, you should avoid using namespaces (namespace) in your application unless absolutely necessary. Processing a document with namespace-enabled attributes slows down the processing of the entire document. The parser not only handles namespace declarations, verifies their correctness, but also ensures that XML documents are well-formed in namespaces.

For applications that do not need to be validated, their documentation should not include <! Doctype...> this line. According to the XML specification, validation handlers (such as XERCES2) must handle internal and external DTD subsets for information about default properties, property types, and so on. Even if the validation attribute is disabled, the handler will still process the DTD.

When you need to validate your application, keep in mind that the costs of processing and validating a DTD are typically smaller than the process and validation of the XML Schema for the common. In addition, you should avoid using a large number of external entities-such as external DTDs or imported XML schemas-because opening and reading files is a costly operation. Also avoid using too many default properties, because this increases the validation time. The redefine structure and identity constraints of an XML Schema should also be avoided, as both will affect the validation process.

Common SAX Performance Tips

For more memory-consuming APIs (such as DOM), choosing SAX can improve the performance of your application. However, there are a number of things you can do to further improve performance. Try the following tips to improve the performance of your SAX application:

Restricting XML names is called an internalization string.

Switch content handlers between multiple handlers to work with large documents.

Loads an external entity with an entity resolver.

Avoid handling external entities.

String internalization

SAX specifies an attribute that is identified by the attribute URI http://xml.org/sax/features/string-interning. When this attribute is set to true, it instructs the parser to report XML names (such as the names of elements and attributes) and namespace URIs in the form of an java.lang.String.intern string, by calling the.

To speed up string equivalence testing, you can turn on the attribute. Instead of calling the Equals () function of a character to be compared, you can compare the name of the parser report to a string constant by reference. If you use the XML name reported by the parser as the key to the hash table, the java.lang.String string can shorten the lookup time when the table calls the Hashcode method. Although not specified in Javadoc, the implementation of this hashcode method usually caches the hash code value in the object after it is computed. Once you have calculated the hash code, it is really easy to get a hash code that has a built-in string.

Some implementations of parsers may not support string internalization attributes. Xerces2 uses an internal string to achieve faster comparison speeds, so this feature is always open.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.