Implementing a high performance parser in Java

Last Update:2017-02-27 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In some cases, you might want to implement your own data or language parser in Java, perhaps in a format or language that lacks standard Java or open source parsers to use. Or, while there are existing parser implementations, they are either too slow or too much for memory, or they don't fit the features you need. Or there is a flaw in an open source parser, or an open source parser project aborted, for a reason. But whatever the reason, the truth is that you have to implement the parser yourself.

When you have to implement a parser yourself, you have a lot of expectations for it, including good performance, flexibility, feature-rich, easy to use, and easy to maintain and so on. In the final analysis, this is your own code. In this article, I'll introduce you to a way to implement a high-performance parser in Java, which is unique, but moderately difficult, not only to achieve high performance, but also to have a more reasonable modular design approach. This design is inspired by the way the Vtd-xml is designed, the fastest Java XML parser I've ever seen, and much faster than both standard Java XML parsers, Stax and sax.

Two basic types of parsers

There are several ways to classify a parser, where I divide the parser into two basic types:

Sequential Access Parser

Random Access Parser

Sequential access is the process by which the parser transmits data to the processor (processor) after the data parsing is completed. The data processor can only access data that is currently being parsed, and it can neither access the parsed data nor access the data that is waiting to be resolved. This parser is also known as an event-based parser, such as sax and Stax parsers.

A random access parser is any data (random access) that the parser allows the data processing code to access randomly before and after the data being parsed. An example of this parser is an XML DOM parser.

The following figure shows the difference between the sequential access parser and the random access parser:

The sequential Access parser only lets you access the "Windows" or "events" that are currently being parsed, and the random access parser allows you to browse through all the parsed data arbitrarily.

Design Overview

The parser design I introduced here is a random access parser.

Implementations of random access parsers are typically slower than sequential access parsers, because they typically create an object tree for parsed data, which data is accessed through the tree. Creating such an object tree not only takes longer CPU time, but also consumes a large amount of memory.

Another better performance than creating an object tree from resolved data is to create a corresponding index buffer for the original data buffers that point to the start and end points of the elements found in the parsed data. The data processing code no longer accesses it through the object tree, but accesses the parsed data directly in the buffer that contains the original data. The following are illustrations of these two approaches:

Since I can't find a better name, I'll simply name it "index Overlay parser" (Overlay Parser). The parser creates an index over it for the original data. This is reminiscent of the way the database index saves data on disk, creating an index of raw, unhandled data for faster browsing and searching for data.

As I said before, this design is inspired by the Vtd-xml (VTD is the virtual token descriptor), so you can also call this parser a virtual token descriptor parser. But I tend to index the name because it shows the nature of the virtual token descriptor, which is the index of the original data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More