Preface
TheVTD-XMLIt is not the author of this article, but the author just introduces it.
Problem
Generally, when we mention the use of XML, the biggest headache is the parsing speed of verbosity and XML. This problem becomes especially serious when we need to process large XML files. What I mentioned here is how to optimize the XML processing speed.
When we choose to process XML files, we generally have two options:
- Dom, A W3C standard model. It constructs XML structure information in a tree and provides interfaces and methods to traverse the tree.
- Sax is a low-level parser that performs read-only operations on elements without structure information.
The two options have their own advantages and disadvantages, but they are not particularly good solutions. Their advantages and disadvantages are as follows:
Dom
- Advantage: ease of use, because all XML structure information is stored in the memory, and the traversal is simple and supports XPath.
- Disadvantage: the parsing speed is too slow and the memory usage is too high (5x ~ of the original file ~ 10x), which is almost impossible to use for large files.
Sax
- Advantage: the parsing speed is fast, and the memory usage is not related to the XML size (the XML memory usage increases without increasing ).
- Disadvantage: poor usability because there is no structure information and it cannot be traversed. XPath is not supported. If the structure is needed, you can only read and construct a little bit, which is especially difficult to maintain.
We can see that Dom and sax are the opposite, but none of them can meet most of our requirements. We need to find another way to deal with them. Note that the efficiency of XML is not the issue of XML itself, but the issue of processing XML parser, just as the two methods we saw above have different efficiency trade-offs.
Thoughts
We like dom-like usage, because we can traverse, which means we can support XPath, greatly enhancing ease of use, but Dom efficiency is very low. As we already know, the efficiency problem lies in the processing mechanism. So what aspects does Dom affect its efficiency? Let's make a comprehensive Anatomy:
- on most of today's platforms based on Virtual Machine (hosting, or any similar mechanism) technologies, object creation and destruction is a time-consuming job (Here it is worth mainly the time-consuming garbage collection ), the creation and destruction of a large number of objects used in the DOM mechanism is undoubtedly one of the reasons that affect the efficiency (leading to excessive garbage collections ).
- each object has an additional 32 bits used to store its memory address. When there are a large number of objects like Dom, this extra cost is not small.
- the most important efficiency problem caused by the above two problems is that both Dom and sax use the extractive parsing mode, which means that Dom and sax both need to create (destroy) a large number of objects, cause efficiency problems. The so-called extractive parsing means that when parsing XML, Dom or sax will extract some of the original files (generally a string ), then parse and construct the data in the memory (the output is naturally one or some objects ). In the example of Dom, Dom parses every element, attribute, processing-instruction, comment, and so on into an object and gives it to the structure. This is called extractive parsing.
- another problem caused by the extractive problem is the update efficiency. In dom (because the update is not supported by sax, It is not mentioned at all), every time a change is required, what we need to do is parse the object information back to the XML string. Note that this Parsing is complete, that is, the original file is not used, instead, it directly parses the DOM model into an XML string. In other words, dom does not support incremental update (incremental update ).
- another "small" problem that may not be noticed is the XML encoding. No matter what resolution method is used, it must be able to process the XML encoding, that is, decoding during reading, encoding when writing. Another efficiency problem of Dom is that when I only want to make small modifications to a large XML file, it must first decode the entire file and then build the structure. It is an overhead.
Let's summarize the problem. Simply put, the efficiency problem of Dom mainly lies in its extractive parsing mode (the same is true for sax, and the same problem exists), which leads to a series of related problems, if we can break these efficiency bottlenecks, we can imagine that the XML processing efficiency will be further improved. If the ease of use and processing efficiency of XML are greatly improved, the application scope and mode of XML will be further improved, this may lead to many wonderful XML-based products that have never been thought of before.
Outlet
VTD-XMLThis is the answer to the above questions. It is a non-extractive XML parser. Due to its excellent mechanism, it can effectively solve (avoid) the problems mentioned above, it also brings other non-extractive benefits, such as fast parsing and traversal, support for xpath, and incremental update. I have a set of data here, taken from VTD-XML's official website:
- The VTD-XML resolution speed is X ~ of SAX (with null content handler ~ 2.0x. With null content handler means that no additional processing logic is inserted in the sax parsing, that is, the maximum speed of the sax.
- VTD-XML memory usage is the original XML 1.3x ~ 1.5x (the 1.0x part is the original XML, 0.3x ~ 0.5x is the part occupied by the VTD-XML), while the DOM memory is 5 x ~ of the original XML ~ 10x. For example, if the size of an XML file is 50 MB, the memory occupied by reading it with the VTD-XML will be 65 MB ~ The memory usage of Dom is between MB and MB ~ Between MB. It is almost impossible to use Dom to process large XML files based on this data. </LI>
You may feel incredible. Can you really make an XML parser that is easier to use than Dom and faster than sax? Don't rush to the conclusion, let's take a look at the principle of VTD-XML!
Basic Principles
Like most good products, the principle of VTD-XML is not complex, but clever. For the purpose of non-extractive, it reads the original XML file in binary format without decoding, then, the location of each element is parsed on the byte array and some information is recorded. Then, the traversal operation is performed on these stored records, to extract XML content, decode the original byte array using location and other information in the record and return a string. All of these seem simple, but this simple process does have multiple performance details and hides several potential capabilities. The following describes the performance details:
- to avoid creating too many objects, the VTD-XML decides to use the original numeric type as the record type so that heap is not needed. The record mechanism of VTD-XML is called vtd (Virtual Token descriptor), vtd will solve the performance bottleneck in the tokenization stage is really very clever very careful practice. Vtd is a 64 bits length value type. It records information such as the starting position (offset), length (length), depth (depth), and Token type of each element.
- note that vtd has a fixed length (64 bits is officially used). This is to improve performance. Because the length is fixed, query and other operations are extremely efficient (O (1), that is, the efficient structure of arrays can be used to organize vtd, greatly reducing the performance problems caused by a large number of objects.
- vtd's super power (not to mention) is that it can simply transform the tree-like Data Structure of XML into a byte array operation, any operations you can imagine on Byte Arrays can be applied to XML. This is because the read XML is binary (byte array), while vtd records the location of each element and other access information. When we find the vtd to be operated, you can perform any operation on the original byte array by using information such as offset and length, or directly operate on vtd. For example, if I want to find an element in a large XML file and delete it, I just need to find the vtd of this element (the Traversal method will be discussed later ), delete the vtd from the vtd array, and then write all the vtd files to another byte array. Because the deleted vtd indicates the location of the element to be deleted, therefore, this element will not appear in the newly written byte array. Writing a new byte array with vtd is actually a copy of the byte array, which is highly efficient, this is the so-called incremental update ).
With regard to the Traversal method of VTD-XML, it uses LC (location cache), simply put, it is a tree-like table structure built with its depth as the standard. LC entry is also a 64 bits long value type. The first 32bits represents a vtd index, and the last 32bits represents the index of the first child of the vtd. You can use this information to calculate any location you want to reach. For specific traversal methods, see the official website'sArticle. The VTD-XML based on this Traversal method has different operation interfaces with Dom, which can be understood, and, this Traversal method of VTD-XML can take you to the place you need in a few steps, the traversal performance is very outstanding.
Summary
As you can see above, VTD-XML has charming features, and now version 1.5 has added support for xpath (as long as it can be traversed, it can support XPath, this is a matter of the morning and evening :-). Its practicality has exceeded what we think today. Another super power of VTD-XML is based on its current processing method, it can fully support the future binary XML standards, and through binary to push XML applications to a higher level! This is what I expect now! :-)
However, VTD-XML still has a lot to improve and improve, which deserves our efforts and discussion.
By the way, VTD-XML is an open source project (GPL), there are currently Java, C Two platform support. If you want to try it in. net, we recommend that you useIkvm(BSD style license) converts VTD-XML to. netProgramSet, I believe you will like it !; -)