Implementing a high performance parser in Java

Source: Internet
Author: User
Tags xml parser

In some cases, you might want to implement your own data or language parser in Java, perhaps in a format or language that lacks standard Java or open source parsers to use. Or, while there are existing parser implementations, they are either too slow or too much for memory, or they don't fit the features you need. Or there is a flaw in an open source parser, or an open source parser project aborted, for a reason. But whatever the reason, the truth is that you have to implement the parser yourself.

When you have to implement a parser yourself, you have a lot of expectations for it, including good performance, flexibility, feature-rich, easy to use, and easy to maintain and so on. In the final analysis, this is your own code. In this article, I'll introduce you to a way to implement a high-performance parser in Java, which is unique, but moderately difficult, not only to achieve high performance, but also to have a more reasonable modular design approach. This design is inspired by the way the Vtd-xml is designed, the fastest Java XML parser I've ever seen, and much faster than both standard Java XML parsers, Stax and sax.

Two basic types of parsers

There are several ways to classify a parser, where I divide the parser into two basic types:

Sequential Access Parser

Random Access Parser

Sequential access is the process by which the parser transmits data to the processor (processor) after the data parsing is completed. The data processor can only access data that is currently being parsed, and it can neither access the parsed data nor access the data that is waiting to be resolved. This parser is also known as an event-based parser, such as sax and Stax parsers.

A random access parser is any data (random access) that the parser allows the data processing code to access randomly before and after the data being parsed. An example of this parser is an XML DOM parser.

The following figure shows the difference between the sequential access parser and the random access parser:

The sequential Access parser only lets you access the "Windows" or "events" that are currently being parsed, and the random access parser allows you to browse through all the parsed data arbitrarily.

Design Overview

The parser design I introduced here is a random access parser.

Implementations of random access parsers are typically slower than sequential access parsers, because they typically create an object tree for parsed data, which data is accessed through the tree. Creating such an object tree not only takes longer CPU time, but also consumes a large amount of memory.

Another better performance than creating an object tree from resolved data is to create a corresponding index buffer for the original data buffers that point to the start and end points of the elements found in the parsed data. The data processing code no longer accesses it through the object tree, but accesses the parsed data directly in the buffer that contains the original data. The following are illustrations of these two approaches:

Since I can't find a better name, I'll simply name it "index Overlay parser" (Overlay Parser). The parser creates an index over it for the original data. This is reminiscent of the way the database index saves data on disk, creating an index of raw, unhandled data for faster browsing and searching for data.

As I said before, this design is inspired by the Vtd-xml (VTD is the virtual token descriptor), so you can also call this parser a virtual token descriptor parser. But I tend to index the name because it shows the nature of the virtual token descriptor, which is the index of the original data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.