Introduction to VTD-XML of emerging XML processing methods

Source: Internet
Author: User
Generally, when we mention the use of XML, the biggest headache is the parsing speed of verbosity and XML. this problem becomes especially serious when we need to process large XML files. What I mentioned here is how to optimize the XML processing speed. Problem

Generally, when we mention the use of XML, the biggest headache is the parsing speed of verbosity and XML. this problem becomes especially serious when we need to process large XML files. What I mentioned here is how to optimize the XML processing speed.

When we choose to process XML files, we generally have two options:

DOM, a W3C standard model. it constructs XML structure information in a tree and provides interfaces and methods to traverse the tree.
SAX is a low-level parser that performs read-only operations on elements without structure information.
The two options have their own advantages and disadvantages, but they are not particularly good solutions. their advantages and disadvantages are as follows:


Advantage: ease of use, because all XML structure information is stored in the memory, and the traversal is simple and supports XPath.
Disadvantage: the Parsing speed is too slow and the memory usage is too high (5x ~ of the original file ~ 10x), which is almost impossible to use for large files.

Advantage: The Parsing speed is fast, and the memory usage is not related to the XML size (the XML memory usage increases without increasing ).
Disadvantage: poor usability because there is no structure information and it cannot be traversed. XPath is not supported. If the structure is needed, you can only read and construct a little bit, which is especially difficult to maintain.
We can see that DOM and SAX are the opposite, but none of them can meet most of our requirements. we need to find another way to deal with them. Note that the efficiency of XML is not the issue of XML itself, but the issue of processing XML Parser, just as the two methods we saw above have different efficiency trade-offs.


We like DOM-like usage, because we can traverse, which means we can support XPath, greatly enhancing ease of use, but DOM efficiency is very low. As we already know, the efficiency problem lies in the processing mechanism. So what aspects does DOM affect its efficiency? Let's make a comprehensive anatomy:

In most of today's platforms based on virtual machines (hosted, or any similar mechanism) technology, object creation and destruction is a time-consuming job (here it is worth mainly the time-consuming Garbage Collection ), the creation and destruction of a large number of objects used in the DOM mechanism is undoubtedly one of the reasons that affect the efficiency (leading to excessive Garbage collections ).
Each object has an extra 32 bits to store its memory address. this extra cost is not small when there are a large number of objects like DOM.
The main efficiency problem that caused the above two problems is that both DOM and SAX are extractive parsing modes, which are doomed to the need for a large number of DOM and SAX to create (destroy) objects, cause efficiency problems. The so-called extractive parsing means that when parsing XML, DOM or SAX will extract some of the original files (generally a string ), then parse and construct the data in the memory (the output is naturally one or some objects ). In the example of DOM, DOM parses every element, attribute, PRocessing-instruction, comment, and so on into an object and gives it to the structure. this is called extractive parsing.
Another problem caused by the extractive issue is the update efficiency. in DOM (because it does not support updates, it is not mentioned at all), every time a change is required, what we need to do is parse the object information back to the XML string. Note that this parsing is complete, that is, the original file is not used, instead, it directly parses the DOM model into an XML string. In other words, DOM does not support Incremental Update (Incremental Update ).
Another "small" problem that may not be noticed is the XML Encoding. no matter what Parsing method is used, it must be able to process the XML encoding, that is, decoding during reading, encoding when writing. Another efficiency problem of DOM is that when I only want to make small modifications to a large XML file, it must first decode the entire file and then build the structure. It is an overhead.
Let's summarize the problem. Simply put, the efficiency problem of DOM mainly lies in its extractive parsing mode (the same is true for SAX, and the same problem exists), which leads to a series of related problems, if we can break these efficiency bottlenecks, we can imagine that the XML processing efficiency will be further improved. If the ease of use and processing efficiency of XML are greatly improved, the application scope and mode of XML will be further improved, this may lead to many wonderful XML-based products that have never been thought of before.


VTD-XML is the answer to the above question after thinking, it is a non-extractive XML parser, due to its excellent mechanism, a good solution (avoid) in addition, it also brings other non-extractive benefits, such as Fast parsing and traversing, XPath support, and Incremental Update. I have a set of data here, taken from VTD-XML's official website:

The VTD-XML resolution speed is X ~ of SAX (with NULL content handler ~ 2.0x. With NULL content handler means that no additional processing logic is inserted in the SAX parsing, that is, the maximum speed of the SAX.
VTD-XML memory usage is the original XML 1.3x ~ 1.5x (The 1.0x part is the original XML, 0.3x ~ 0.5x is the part occupied by the VTD-XML), while the DOM memory is 5 x ~ of the original XML ~ 10x. For example, if the size of an XML file is 50 MB, the memory occupied by reading it with the VTD-XML will be 65 MB ~ The memory usage of DOM is between MB and MB ~ Between MB. It is almost impossible to use DOM to process large XML files based on this data.
You may feel incredible. can you really make an XML parser that is easier to use than DOM and faster than SAX? Don't rush to the conclusion, let's take a look at the principle of VTD-XML!

Basic principles

Like most good products, the principle of VTD-XML is not complex, but clever. For the purpose of non-extractive, it reads the original XML file in binary format without decoding, then, the location of each element is parsed on the byte array and some information is recorded. then, the traversal operation is performed on these stored records, to extract XML content, decode the original byte array using location and other information in the record and return a string. All of these seem simple, but this simple process does have multiple performance details and hides several potential capabilities. The following describes the performance details:

To avoid creating too many objects, the VTD-XML decides to use the original numeric type as the record type, so you don't have to use heap. The record mechanism of VTD-XML is called VTD (Virtual Token Descriptor), VTD will solve the performance bottleneck in the tokenization stage is really very clever very careful practice. VTD is a 64 bits length value type. it records information such as the starting position (offset), length (length), depth (depth), and token type of each element.
Note that VTD has a fixed length (64 bits is officially decided). This is to improve performance, because the length is fixed and is being read, query and other operations are extremely efficient (O (1), that is, the efficient structure of arrays can be used to organize VTD, greatly reducing the performance problems caused by a large number of objects.
The super power of VTD (not exaggerated at all) lies in its ability to simply transform the tree-like data structure of XML into a byte array operation, any operations you can imagine on byte arrays can be applied to XML. This is because the read XML is binary (byte array), while VTD records the location of each element and other access information. when we find the VTD to be operated, you can perform any operation on the original byte array by using information such as offset and length, or directly operate on VTD. For example, if I want to find an element in a large XML file and delete it, I just need to find the VTD of this element (the traversal method will be discussed later ), delete the VTD from the VTD array, and then write all the VTD files to another byte array. because the deleted VTD indicates the location of the element to be deleted, therefore, this element will not appear in the newly written byte array. writing a new byte array with VTD is actually a copy of the byte array, which is highly efficient, this is the so-called incremental update ).
With regard to the traversal method of VTD-XML, it uses LC (Location Cache), simply put, it is a tree-like table structure built with its depth as the standard. LC entry is also a 64 bits long value type. The first 32bits represents a VTD index, and the last 32bits represents the index of the first child of the VTD. You can use this information to calculate any location you want to reach. for details about the traversal method, see the official website article. The VTD-XML based on this traversal method has different operation interfaces with DOM, which can be understood, and, this traversal method of VTD-XML can take you to the place you need in a few steps, the traversal performance is very outstanding.


As you can see above, VTD-XML has charming features, and now Version 1.5 has added support for XPath (as long as it can be traversed, it can support XPath, this is a matter of the morning and evening :-). Its practicality has exceeded what we think today. Another super power of VTD-XML is based on its current processing method, it can fully support the future Binary XML standards, and through Binary to push XML applications to a higher level! This is what I expect now! :-)

However, VTD-XML still has a lot to improve and improve, which deserves our efforts and discussion.

The above is the new XML processing method VTD-XML introduced content, more relevant content please pay attention to PHP Chinese network ( )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.