Disclaimer: This article has been transferred from Http://www.blogjava.net/orangelizq/archive/2009/07/19/287330.html, and some changes have been made to the original text.
Absrtact: XML has been widely used as one of the most popular technologies in the past decade, and XML parsing technology is the key to XML application. This paper introduces the research trend of XML parsing technology, analyzes and compares the advantages and disadvantages of 4 kinds of XML parsing technology, and sums up the principles of choosing appropriate XML parsing technology in application system design.
The first part of the introduction
XML (extensible Markup Language, Extensible Markup Language) is a meta language defined by the World Wide Web Consortium, a language that is about languages. The design of XML originates from SGML (Standard generalized Markup Language, Standard Universal Markup Language), a subset of SGML, designed to facilitate the exchange of structured documents on the Internet. Simply put, XML is a set of rules and guidelines for describing structured data in unformatted text. The consortium began working on the standardization of XML in 1996 and released the XML1.0 on February 10, 1998.
The emergence of XML has brought great influence to the field of distributed computing, and its power derives from its data independence. XML is a pure data description, independent of the programming language, operating system, or transport protocol, freeing data from the constraints generated by the code-centric infrastructure, allowing data to flow more freely on the web.
However, XML itself is only a format for encoding data in plain text, to take advantage of XML, or to take advantage of the data encoded in an XML file, you must first parse the data from plain text, so you must have a parser that recognizes the information in the XML document to interpret the XML document and extract the data from it. However, according to the different needs of data extraction, there are many analytic methods, different analytical methods have their own advantages and disadvantages and applicable environment. Choosing the appropriate XML parsing technology can improve the performance of the application system effectively, so it is very important to understand and distinguish different XML parsing techniques.
Analysis of XML parsing technology in the second part
All XML processing starts with parsing, whether using XSLT or the Java language, the first step is to read XML files, decode structures and retrieve information, and so on, which is parsing, which translates an unstructured sequence of characters representing an XML document into a structured component that satisfies the XML syntax.
Classification of 2.1 XML parsing techniques
Based on the simplicity of obtaining data from XML, and the difference between performance and the resulting data model, XML parsing techniques can be broadly grouped into the following four categories:
1 Document-oriented flow parsing; sax
2 Object-oriented parsing of documents; DOM
3 Document-oriented analysis of the pointer;
(4) Object-oriented parsing for application; hibernate
These four types of analytic techniques are at different levels of abstraction and are suitable for different application scenarios and have their own advantages and disadvantages. According to the specific application requirement, choosing the appropriate analytic technology can reduce the memory consumption, shorten the processing time, obtain the data more conveniently, and improve the overall performance of the application system.
2.2 Document-oriented streaming parsing technology
streaming parsing is an event-based parsing process in which the parser sequentially reads an XML document, generates a corresponding event flow, and sends the event handlers to the events that are captured, such as element initiation and element completion, Event handlers handle these events in different ways.
Streaming parsing is the processing of an XML document as a stream of data, so it has the advantage of streaming media to start reading the data immediately rather than waiting for all the data to be processed. Also, because an application examines data only while reading it, there is no need to load the entire document into memory at once, making it more time and space efficient to work with large documents. However, the cost of efficiency is the reduction of ease of use, flow parsing programming is more complex, programmers need to take charge of more operations. And because the application does not store data in any way, it is impossible to change the data or move backwards in the data stream. Coupled with its single-pass parsing feature, it means that it also does not support random access.
Flow parsing is divided into two analytic ways: push-parsing (SAX) and pull parsing (StAX). The main difference between the two approaches is whether the parser or the application controls the read loop (the loop that reads the file).
2.2.1-Push parsing (sax parsing technology)
The SAX (simple API for XML) parsing technique is a push parsing in which the parser controls the read loop and the control is not returned to the application until the end of the document. The parser does data processing by means of a callback.
Sax provides an event-driven, simple API for working with XML. Its design began with discussions among Xml-dev mailing list members, and the first draft interface they developed SAX1.0 was released in January 1998, followed by the release in May 2000 of SAX2.0, the latest version of which is the April 2004 SAX2.0.2. Sax is not recognized by the official standards body, it is not maintained by the consortium or any other official organization (now, Sax is maintained by David Megginson), but it is widely used and regarded as the de facto standard of the XML community. Sax was originally defined for Java, but it can also be used in other languages, such as Python, Perl, and C + +.
Sax is event-driven, that is, the SAX parser generates an event stream during the reading of an XML document and handles each event by the appropriate method in the callback event handler. such as element start and end tags, element content, entities, parsing errors, and other events. For the following simple XML document, the resulting events, as shown in Figure 1, note that a text event is also generated for spaces or carriage returns within an element.
The core event handler in Sax is a class that implements the ContentHandler interface. This interface defines methods for handling events associated with the XML document itself, such as Startdocument, Enddocument, Startelement, endelement, characters, and so on
The SAX parsing technique has the advantages and disadvantages of all streaming parsing techniques, but because the parser controls the control until the end of the document throughout the parsing process, it is difficult for the application to stop the parsing process after obtaining some of the required data (it can terminate the parsing process by throwing an exception, but more complex, And the parsing process cannot be continued after termination), thus creating a pull-resolution method that is controlled by the application.
2.2.2 Analysis (Stax analytical Technique)
The StAX (streaming API for XML) parsing technique is a pull parsing in which the application controls the read loop. Loop, the application is responsible for repeatedly calling the parser to get the next event until the end of the document. By preserving the control of the parsing process, you can simplify the calling code to handle exactly what it expects, and you can stop parsing at any time. Also, because the method is not based on a handler callback, the application does not need to simulate the state of the parser as in sax.
Stax the event type for the same XML document is essentially the same as sax, but Stax contains two APIs for processing XML: pointer based APIs and iterator based APIs that provide varying degrees of abstraction.
The pointer based API simply returns the event, at which point the event is represented in numerical form. This is a low-level API that does not provide an abstraction of the underlying XML structure, where all state information is obtained directly from the stream reader and no additional objects need to be created. Thus saving memory, with high efficiency.
The more advanced iterator-based APIs return events as objects, each of which encapsulates the information inherent in the particular XML structure it represents, so it can be used directly to obtain information that belongs to the structure, but also requires additional object creation overhead. Compared to the pointer based APIs, the iterator based APIs have more object-oriented features and are therefore more easily applied to modular architectures.
Stax is also defined in Java, and its StAX1.0 was released in March 2004 and became the JSR-173 specification, with the latest version being StAX1.2 released in June 2006. Stax, as the latest standard for processing XML in the Java language, is more powerful and widely used than the XPP (XML Pull Parser) Pull parser that emerged earlier.
2.3 Document-oriented object-based parsing technology
because of the inherent inability to change data and not support random access features in streaming parsing, it is difficult for applications to search, modify, add and delete XML documents without modeling the structure of XML documents. In order to solve these problems, a document-oriented object parsing technology--dom is produced.
The goal of DOM (Document Object model) is to model an XML document in a platform-and language-independent manner, with the objective of providing an interface that can be used in a variety of programming languages, operating systems, and applications. The DOM was originally used as a web browser to identify and process page elements, the function of which is called "DOM level 0" before the engagement of the consortium. In October 1998, the consortium presented a "DOM Level 1" recommendation to support XML1.0 and HTML processing. The "DOM Level 2" proposal was subsequently introduced in November 2000, and Level 1 was extended to support XML1.0, namespaces, and CSS, as well as user interface and tree operation events, and DOM tree manipulation was added. The latest "DOM Level 3" recommendation was presented in June 2003, with support for DTDs, XML schemas, and XPath added to level 2.
As an object-type parsing technique, DOM defines a hierarchical object model to represent XML documents. This defines the corresponding class for each concept in the XML syntax, such as elements, attributes, entities, documents, and so on, and when the parser reads the XML document, it establishes a one by one mapping between the XML syntax and the class. In fact, the DOM's hierarchical object model is a tree-shaped structure that sees an XML document as a node tree, and each node represents an element in an XML document. The basic node objects of the DOM are 5: (1) The Document object: the highest node of the tree and the portal to the entire document operation; (2) Element and attr objects: mapping elements and element attributes in documents; (3) Text object: A child node that acts as an element and attr object, representing the textual content of elements or attributes, (4) NodeList objects: Traversing nodes in a specified manner.
For example, for an XML document in 2.2.1, its corresponding DOM node tree is shown in the following illustration (note that the space within the element or carriage return is treated as a text object):
Using the tree structure of the complete XML document established in the DOM in memory, the developer can do a series of operations on XML document conveniently, such as traversing, adding, deleting, modifying the content of the document, and having good navigation ability. At the same time, the object features of Dom are very convenient for object-oriented programming. However, because DOM requires a complete traversal of XML documents before using data and a tree-structured representation in memory, it consumes a lot of memory, especially for large documents, which can degrade quickly. and must parse the entire XML document at once, it is impossible to do only partial parsing, when only focus on the small data of the XML document, the efficiency is very low. (The Axiom object model in Axis2 project realizes partial parsing of XML document, can construct incomplete node tree, but it is more complicated)
Because DOM is language-independent, when the DOM interface enters the data structure of the specified language, it creates unnecessary complexity and cannot take advantage of the language itself. As a result, there are a number of language-specific object models that resemble the DOM. If Jdom is a specific document Object model for Java, Jdom uses specific classes instead of interfaces, simplifies the API, and uses Java collection classes extensively in the API. DOM4J is an intelligent branch of Jdom that provides support for XPath and XML schemas and has parallel access through the DOM4J API and standard DOM interfaces. They all belong to document-oriented object parsing techniques.
2.4 Document-oriented analytic technique of pointer
The aforementioned document-oriented streaming parsing is more efficient, but less user-friendly, while object parsing is more user-friendly and inefficient, and these two approaches seem to be at two extremes. Its efficiency is mainly in two ways is to extract the analytic mode, that is, when parsing, extract a part of the source file, is generally a string, and then in memory for the analysis of the construction. This parsing pattern is doomed to the need for a large number of creation and destruction objects, there is also an update efficiency issue in the DOM (sax does not support updates), and every change requires that the DOM model be fully parsed into an XML string, and the original file is not exploited, that is, the DOM does not support incremental updates. In order to solve these problems, a novel pointer-type analytic technique, namely vtd-xml, is proposed.
Vtd-xml is a kind of XML parsing method without extraction, which solves the disadvantage that Dom occupies too much memory, and provides fast parsing and traversal, support for XPath and incremental update. Vtd-xml is an open source project, currently has Java, C two platform support, the first version is released in June 2004 vtd-xml0.5, its vtd-xml1.0 version was released in October 2005, the latest version of the October 2007 released vtd-xml2.2.
VTD (virtual Token descriptor, dummy token descriptor) is a 64bits-length numeric type that records information such as the starting position, length, depth, and type of token for each element, as shown in Figure 3. The 64bits fixed length allows the efficient structure of arrays to organize VTD and greatly improve performance. VTD is the key to implementing no extraction parsing, it is similar to the pointer to an element in an XML document, and it can be quickly positioned to an element.
The token start offset (that is, the distance relative to the head of the XML document) is bits, which means that the maximum file it can parse is 1G. The token length is 20bits, that is, the maximum length of a token is 1M. Token type 4bits, which supports 16 types of vocabulary.
In order to achieve the goal of no extraction, vtd-xml the original XML file into memory in binary form, without decoding, and then parses the position of each element on the bit array and records the information such as the start offset, length, depth, and token type of the XML token, is saved as a VTD array, and subsequent traversal operations can be performed on the VTD array. If you need to extract the XML content, look for the VTD array, decode the original bit array with the position in the VTD record, and return the string.
And Vtd-xml can also efficiently implement incremental updates, for example, if you want to find a node element in a large XML document and delete it, you just need to find the vtd of the element and remove the VTD from the VTD array. You can then use all the VTD to write to another binary array, because the deleted VTD indicates the position of the element to be deleted, so the element will not appear in the newly written binary array. The process of writing a new binary array with VTD is actually a copy process of a binary array, and its efficiency is very high.
Thus, vtd a good solution to the shortcomings of the first two analytical methods, through its ingenious design so that in the parsing of XML documents memory footprint, high efficiency, and can also achieve the rapid parsing and traversal of XML documents, providing support for XPath. The emergence of VTD is a great progress of XML parsing technology, which will have great influence on the development of XML parsing technology.
2.5 Application-oriented object-based parsing technology
The three parsing techniques discussed earlier are all about processing documents and modeling from an XML perspective, which is true for applications that are primarily concerned with the XML structure of the document, but there are many applications that simply use XML as a medium for data exchange, They are more concerned with the document data itself. In this case, application-oriented object resolution (or XML data binding) enables the application to largely ignore the actual structure of the XML document and directly use the document's data content.
Data binding is the process of taking data out of some storage media, such as XML documents, text files, and databases, and representing the data through an application, which binds the data to an in-memory structure that the virtual machine can understand and manipulate. Data binding is not a new concept, it has already been widely used in relational database, such as Hibernate is a lightweight data binding framework for databases. The CASTOR framework for XML data binding has emerged in 2000, and many similar frameworks have emerged, such as Jbind, JAXB, JiBX, Quick, and Zeus.
where JAXB (Java architecture for XML Binding) is a constantly evolving data binding framework that is applied to the Java platform, providing an API for automatically mapping between XML documents and Java objects, Compliant with the Jsr31--xml Data binding specification (XML Data Binding specification). The project began in August 1999 and was developed by the Java Community process, and its 1.0 release was released in October 2002.
The following figure shows the application of data binding in databases and XML documents.
There are three important concepts in data binding:
Marshalling (marshalling): The process of converting an in-memory data to a storage medium. In the Java and XML environment, grouping is the transformation of some Java objects into one (or more) XML document. Its core is to transform the object-oriented structure in Java into a flat structure suitable for XML. Solution Group (unmarshalling): The process of converting data from a storage medium into memory. In Java and XML environments, the complexity of XML documents being extracted into Java virtual machines is the mapping from data to Java code variables. Mappings (Mapping): A set of rules for grouping and reconciliation groups. At first glance XML data binding and document-oriented object parsing are similar, all are built in memory to represent the document, while internal representations and standard XML documents can be converted to each other. But the difference between the two is that the document model is as close as possible to the document structure where the XML is saved, and data binding is concerned only with the document data used by the application. The document model and data binding model for the same XML document are completely different, as shown in the following illustration.
If the application uses the document model method, the required data must be traversed in the node tree according to the parent-child node relationship. With the data binding method, it is easier to access the data and faster than the document model by simply doing normal Java programming. Moreover, XML data binding is not just simplified programming, because it abstracts many document details, so data binding requires less memory than the document model requires, as in the previous illustration, the document model method uses 10 separate objects, and data binding uses 2. In addition, because there are so many objects to build, it may be quicker to build a data binding representation for an XML document.
The core of XML data binding is how to generate Java objects from XML documents. There are two ways to do this: map bindings and how code is generated [. In mapping bindings, build Your own Java classes and specify to the binding framework how these classes relate to the XML document. This approach is supported by the framework Castor and quick. Code generation, however, automatically builds corresponding Java classes based on the XML document structure (i.e., DTD or schema grammar), such as JAXB, Castor, and jbind that provide Java code based on schema descriptions of XML documents. Quick and Zeus can generate Java code based on the DTD description.
The classes constructed by code generation can include complete data type information, as well as the ability to authenticate the constructed classes. However, this approach makes the program code and document structure tightly coupled, and if the document structure changes, you need to regenerate the code. The mapping binding method has more flexibility, by combining data and behavior with the object classes that you build, you can somehow decouple the object class from the actual XML document by modifying the mapping definition rather than changing the application code to handle the small changes in the XML document structure. The disadvantage is that you need to write more complex mapping files.
the third part of the comparison of various analytic techniques
document-oriented flow parsing efficiency is high, but the usability is poor, and the object-type parsing is very easy to use, but inefficient.
VTD a good solution to the first two methods of analysis, through its ingenious design of the XML document when the memory footprint, high efficiency, and can implement the XML document fast parsing and traversal, provide the support of XPath. The emergence of VTD is a great progress of XML parsing technology, which will have great influence on the development of XML parsing technology.