It has been about nine years since the emergence of XML. This is a short journey for extensible markup language. It is difficult to find an application that does not need XML at all.
However, when working with customers, it is inevitable that basic things have not been fully understood. Developers who have a thorough understanding of complex XML themes recently found that there are still many deficiencies in understanding basic things (such as parsing), which is a little surprising.
Where does XML Processing start? Yes, it is resolution. Parsing may be the most basic service that developers can use.ParserRead the XML document, interpret the syntax, and pass meaningful objects to the application. The parser may also provide other services, such as validation (ensure that the document complies with the XML schema or DTD) or namespace resolution.
This article introduces various parsing methods, focuses on analyzing their respective advantages and disadvantages, and helps you select appropriate tools in the next project. This article contains links to a large number of articles. When selecting a tool, you can study the given API in detail.
Why is resolution important? Because all XML processing starts from parsing. Whether using high-level programming languages (such as XSLT) or low-level Java programming, the first step is to read XML files, decode the structure and retrieve information. This is parsing.
The first option for parsing a document is to use a ready-made Parsing Library (basically available in each programming language, including the COBOL [Common Business Oriented Language]) or create one by yourself. The answer is very simple: select a ready-made library.
Frankly speaking, XML is not a complex syntax, so it is understandable to think that you can use regular expressions or other special methods to parse it. But in fact it is difficult to succeed: XML syntax requires support for multiple types of encoding and many unpredictable features, such as the CDATA section and entity. It is almost difficult for custom implementations to take care of all these aspects, resulting in incompatibility.
On the contrary, most of the parser provided in the development environment has passed compatibility tests. The main reason for adopting standard syntaxes such as XML is compatibility with other applications and toolboxes, which is one of the cases where a well-tested library is really worth using.
Most Resolvers provide at least two APIs, usually oneObject ModelAPI andEventAPI (also knownStreamAPI ). For example, the Java platform provides both Dom (Document Object Model) and SAX (Simple API for XML ).
These two sets of Apis provide the same service: Document decoding, optional verification, and namespace resolution. The difference lies not in the service but in the data model used by the API.
Object ModelAPI defines a hierarchical object model to represent XML documents. In other words, each concept in the XML syntax defines the corresponding classes: elements, attributes, entities, and documents. When the parser reads XML documents, it establishes a one-to-one ing between XML syntax and classes. For example, an element class is instantiated every time a tag is encountered.
It is not surprising that there are some disputes over which data model is best. W3C standardizes Dom. Its main advantage is Portability: it is defined as a CORBA interface and is mapped to many languages. Therefore, if you understand the DOM in Javascript, you also know the DOM in Java, C ++, Perl, Python, and other languages.
Another Data Model is JDOM, a Java-optimized dom (specifically for Java), which is more closely integrated with the Java language, but lacks portability by definition.
Although people can continue to discuss which data model is the best for XML syntax, I don't think it makes much sense, because the advantages and disadvantages of various object-based APIs are basically the same. In good terms, if you are familiar with XML syntax, the object model API is easier to understand. Because it maps the XML syntax directly to the class, it is easy to learn, use, and debug.
The simple cost is efficiency, at least for many projects. When reading a document, the parser creates an object based on the syntax structure. For many applications, the XML syntax is not very suitable:
- The XML syntax is very wordy. Even if the document is small, the parser creates many objects.
- The Optimization of the XML vocabulary is usually aimed at storage and data transmission efficiency, rather than processing, so the application may need to pre-process the data. For example, before starting the real processing, first, compute parts or merge data from other sources. In many cases, data must be copied from the XML Object Model to the application-specific object model or database before processing.
- This object model is generic and contains references between objects that are not required by many applications (for example, reverse references from child elements to parent elements ). These references further increase memory consumption.
Processing small documents on the desktop may not be a big problem, but in other environments, such as servers, the inherent inefficiency of the object model is unacceptable.
The second option isEventAPI, such as sax. This concept is a reflection of the above object model. However, this method does not define a general data model based on XML syntax. Its parser relies on application programmers to establish a customized data model.
Therefore, the parser can be smaller, because only the minimum amount of information needs to be transferred. More importantly, andOne model hits the worldThe Object Model (no matter how good the object model) is more efficient than the total, the programmer can customize the object model according to the needs of the application.
It has obvious advantages:
- Any application that counts applications or summarizes information can benefit from this, because their data models only need to calculate the sum and do not need to replicate the entire document.
- Similarly, even applications that dynamically process documents (such as loading documents to a database) do not need to process or process a small amount of data, because they do not need to store data at all.
Because the memory requirements are reduced, the event API can process documents of any size, including documents whose size exceeds the available memory. For the same reason, this type of API is also very suitable for servers with concurrent execution of multiple processes and shared memory.
The cost of efficiency is the cost of simplicity. Event APIs have always been difficult to use because application programmers are responsible for more operations. Although this seems to be the case in the short term, according to my experience, efficiency improvements in the medium and long term are enough to offset the slightly increased complexity.
StreamAPIS can be pushed or pulled. Historically, the push method is more popular, because it is the model adopted by sax. The push method is being standardized and will soon be integrated into the Java platform as Stax.
What is the difference between the two? The difference is who controls the read loop. Like any software that reads files, the parser is centered aroundRead Loop(Loop of reading files) created.
InPushIn the mode (SAX), the parser controls the loop. In fact, when the application calls the parser, the control will not be returned to the application before the end of the file. As mentioned above, the parser calls back the application to establish a data model, and the parser is under control.
InPullApplication control loop. In the loop, applications call the parser repeatedly until the file ends.
Push mode is most suitable for reading and processing XML documents while reading, such as reading RSS feeds and displaying them as HTML webpages. For most applications that use XML to store data, it is most convenient to use a single call to the parser to read the document.
The pull mode is more suitable for processing documents with different XML vocabulary. This type of applications usually need to sniff the input (read the first few rows) To decide to call the subroutine according to the vocabulary.
For applications that control the parser, a loop is necessary because it is easy for the application to stop reading after sniffing the previous lines.
If you do not mention another option, that is, parsing in the form of XML database groups, such as Castor, this article is incomplete. This method is between the object model and the event method.
The idea is to generate an object model from XML Schema rather than a common model (such as Dom). The parser generates a data model more specifically for the vocabulary used. For example, if the vocabulary is used to process the invoice, it is expected to include the sender, recipient, date, product category, product identifier, unit price, and total price. Dom maps these elements to a general element class.Group databaseCreate a special class for the sender, recipient, date, product category, product identifier, unit price, total price, and other elements in the document.
In terms of processing vocabulary customization (which may be the same or different from the general data model as the application needs), the grouping database has some advantages of the event API.
The parser reads and decodes XML documents and transfers them from the disk to the memory. What should I do if I move in another direction? What if an application needs to store data in an XML file?
Although I recommend that you avoid using special routines to decode XML documents, you do not have such concerns about writing XML. All the rules must be implemented during reading, including some obscure information. However, when writing, you can implement a small, workable subset of vocabulary.
However, most object model APIs still take dual responsibilities. In addition to reading, they can also write the object tree to the disk. If you use the event API, you can generate write events from the data structure (see references ).
So what is the conclusion? The API used to read XML documents has an important impact on the overall nature of the application. Therefore, you must take the time to familiarize yourself with various options for your platform and programming language, more importantly, make the best choice for your project.
Generally, event APIs consume less resources and therefore are more efficient. However, if you want to save the entire document to the memory in any case, the object API is better because it can save a lot of code.