Implementing a High-performance Java parser

Source: Internet
Author: User
Tags implement xml parser

Note: This article is about the latest version of previous articles on the same topic. The previous article introduces some of the main points of creating a high-performance parser, but it absorbs some of the criticism from readers. The original article was comprehensively revised and supplemented with relatively complete code. We hope you like this update.

If you do not specify a data or language standard or open source Java parser, you may often want to implement your own data or language parser in Java. Or, there may be many parsers to choose from, but they are either too slow, too memory-intensive, or without the specific functionality you need. Or the source parser is flawed, or the open Source parser project is canceled, and so on. None of the above reasons are important to the fact that you will need to implement your own parser.

When you have to implement your own parser, you want it to behave, flexible, feature-rich, easy to use, and finally, but more importantly, easy to implement, after all, your name will appear in the code. In this article, I'll introduce a way to implement a high-performance parser in Java. This method is not exclusive, it is simple, and realizes the high performance and reasonable modular design. The design is inspired by Vtd-xml, the fastest Java XML parser I've ever seen, faster than Stax and sax Java standard XML parsers.

Two basic parser types

There are many ways to classify parsers. Here, I only compare the differences between the two basic parser types:

Sequential access parser (sequential access parser)

Random Access parser (Random Access parser)

Sequential access means that the parser parses the data and then transfers the parsed data to the data processor after parsing is completed. The data processor accesses only the currently resolved data, and it cannot go back to the previous data and process the previous data. The sequential access parser is already common, and even as a benchmark parser, sax and Stax parsers are the best known examples.

The random access parser can be either on the parsed data or let the data processing code forward and backward (random access). The random Access parser example sees the XML DOM parser.

The sequential Access parser lets you access just the parsed window or event in the document stream, and the random access parser allows you to access the traversal in the way you want.

Design outline

The parser design I'm introducing here is a random access variant.

Random access parser implementations are always slower than sequential access parsers because they are typically built on a tree of resolved data objects that the data processor can access. Creating the object tree is actually slow on the CPU clock and consumes a lot of memory.

Instead of building the object tree on the parsing data, a more high-performance approach is to establish an index cache that points to the original data cache. The index points to the element start and end points of the parsed data. Instead of accessing the data through the object tree, the data processing code accesses the parsed data directly in the cache containing the original data. The following are schematic diagrams of the two methods:

Because I didn't find a better name, I called the parser the index overlay parser. The parser creates a new index overlay layer on the original data. This is reminiscent of how the database builds the data indexes stored on the hard disk. It creates pointers on raw unhandled data, making browsing and searching data faster.

As previously mentioned, the design is inspired by Vtd-xml, VTD is the initials of the virtual token descriptor (the fictitious Token descriptor). So you can call it a virtual token descriptor parser. However, I prefer the naming of index overlays, because this is the index of the virtual token descriptor, on the original data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.