Build your own lightweight XML DOM analyzer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

dom|xml| Program

XML is rapidly becoming the standard format for data storage and exchange. The full Java XML parser available now is huge and powerful-but it also consumes the same amount of resources while implementing these powerful features. For example, the popular Apache Xerces-j Analyzer is more than 1.7 MB, and the latest full sun JAXP (Java application programming interface for XML processing) implements a package that is over 3MB. So using a powerful XML parser can be too wasteful. If the configuration environment is a Java applet or a J2ME application, network bandwidth or system memory constraints may not be able to use the full XML parser at all. This article will show you how to build a lightweight XML DOM analyzer.

Start writing Simpledomparser

Simpledomparser is a highly simplified and ultra lightweight XML DOM parser that uses Java writing. You can configure the entire parser to be a. jar file less than 4KB. The source program is less than 400 lines.

Obviously, with such a small code, Simpledomparser will not support the XML domain namespace, can not understand the multiple character set encoding or a DTD file or schema validation file But what Simpledomparser can do is parse XML tags that conform to the syntax rules into a DOM-like element tree that lets you perform a common task of extracting data from XML formatted text.

Why use DOM as a model rather than sax? This is because DOM provides a more user-friendly program interface than sax. Unlike sax, when you process an XML file as a DOM tree, all of the information in the file is available. Although the SAX analysis model provides better performance than the DOM model and uses less storage space, most developers will find themselves creating a complete or partial DOM tree when using sax. With sax, an application can handle only one tag at a time. If other markup content must be used in the process, you must maintain a global presence throughout the processing process. Keeping the global state is the essence of the DOM model's purpose. However, many small XML applications do not require a complete DOM model. As a result, Simpledomparser provides access to tag names, hierarchies, and content, but does not involve many features of the full world wide-access DOM.

Simplifying the DOM model

A DOM tree is made up of nodes that are generated by parsing XML files. A node is a non-storage representation of an XML entity. There are several types of nodes in the standard common-system DOM model. For example, a text node represents a piece of text in an XML file, an element node represents an XML file, and a property node represents an attribute name and value within an element.

The DOM is a tree because there is a parent node for every node except the root or file node. For example, a property node is always associated with an element node, and the text used to encapsulate the element's start and end tags is mapped to a text node. A text node is a child of an element node. As a result, even a very simple XML file may require a variety of node types to perform. For example, figure 1 represents a world-view DOM tree representation of the following XML file.

Simpledomparser

As you can see in Figure 1, the DOM model uses a document type node to encapsulate the entire XML file, so the DOM uses three different nodes. Simplify the DOM model as much as possible by abstracting all the DOM node types into a single type simpleelement. A simpleelement obtains key information about an XML element, such as an identity name, element attributes, and any encapsulated text or XML. In addition, Simpledomparser does not use any special node type to represent the highest level of documentation. The result is a large simplification of the DOM tree so that it contains only simpleelement nodes. Figure 2 represents a simplified DOM tree.

Code Snippet 1 gives the complete source program for the Simpleelement class.

public class Simpleelement {
private String tagName;
private String text;
Private HASHMAP attributes;
Private LinkedList childelements;

Public simpleelement (String tagName) {
This.tagname = tagName;
attributes = new HashMap ();
Childelements = new LinkedList ();
}

Public String gettagname () {
return tagName;
}

public void Settagname (String tagName) {
This.tagname = tagName;
}

Public String getText () {
return text;
}

public void SetText (String text) {
This.text = text;
}

Public String getattribute (string name) {
Return (string) attributes.get (name);
}

public void setattribute (string name, String value) {
Attributes.put (name, value);
}

public void addchildelement (simpleelement element) {
Childelements.add (element);
}

Public object[] Getchildelements () {
return Childelements.toarray ();
}
}

Defining XML parsing basic elements

In order to process an XML file into the simplified DOM tree model mentioned above, we must define some basic parsing rules. Using these rules, the parser can easily extract tags or blocks of text from the input XML file.

The first is peek, which returns the next character from the input XML file without actually having to get the character from the underlying stream. By keeping the integrity of the input stream, advanced functions such as Readtag and ReadText (described later) can make it easier to get what they want based on the characters they're looking for next.

private int Peek () throws IOException {

Reader.mark (1);

int result = Reader.read ();

Reader.reset ();

return result;

}

The next method is Skipwhitespce, which is to skip spaces, tabs, or carriage returns in the input XML stream.

private void Skipwhitespace () throws IOException {

while (Character.iswhitespace (char) peek ()) {

Reader.read ();

}

After creating the two methods as described above, we can write a function to retrieve the XML tags from the input file.

Private String Readtag () throws IOException {

Skipwhitespace ();

StringBuffer sb = new StringBuffer ();

int next = peek ();

if (Next!= ' ") {

throw new IOException

("Expected > but got" + (char) next);

}

Sb.append ((char) reader.read ());

while (Peek ()!= ' > ') {

Sb.append ((char) reader.read ());

}

Sb.append ((char) reader.read ());

return sb.tostring ();

}

In conjunction with the Peek method, the Readtag function only gets the content of one tag and the other functions to handle other content. The last method is the READTEXT function, which is used to read text between XML tags.

Private String ReadText () throws IOException {

Int[] Cdata_start = {', ', '! ',

' [', ' C ', ' D ', ' a ', ' T ', ' a ', ' ['};

Int[] Cdata_end = {'] ', ' ', ' ', ', ', ', ', '};

StringBuffer sb = new StringBuffer ();

int[] Next = new Int[cdata_start.length];

Peek (next);

if (Compareintarrays (Next, cdata_start) = = True) {

Cdata

Reader.skip (next.length);

int[] buffer = new Int[cdata_end.length];

while (true) {

Peek (buffer);

if (compareintarrays

(buffer, cdata_end) = = True) {

Reader.skip (buffer.length);

Break

} else {

Sb.append ((char) reader.read ());

}

} else {

while (Peek ()!= ' ") {

Sb.append ((char) reader.read ());

}

}
return sb.tostring ();

}

The Peek method used this time is a variant of the Peek method that returns a sequence of strings from the underlying XML document. This peek variant lets the parser determine whether the text it will parse is loaded into a CDATA block. The Compareintarrays function is a simple program that performs a depth comparison of two integer arrays.
XML syntax analysis strategy and SIMPLEDOMPARSER implementation

Unlike a normal text document, an XML document with a symbolic grammar rule has some unique features that can facilitate the parsing process:

All tags must match in an XML document. Each start tag must have a matching closing tag, except when the tag itself is two start and end tags, such as the simple format is . Tag and property names are case sensitive.

All tags in an XML document must be nested correctly. XML tags cannot be nested. For example: A containing ... document is incorrect because the closing tag appears before the closing tag.

With these rules in mind, the Simpledomparser parsing strategy should follow the pattern shown in the following pseudocode:

While not EOF (Input XML Document)

Tag = Next tag from the document

Lastopentag = top tag in Stack

If tag is an open Tag

Add Tag as the child of Lastopentag

Push Tag in Stack

Else

End tag

If tag is the matching close Tag of Lastopentag

Pop Stack

If Stack is empty

Parse is complete

End If

Else

Invalid tag nesting

The error of the

End If

End While

The key to this algorithm is the tag stack, which holds the start tag that is obtained from the input file but does not match their end tag. The top of the stack is always the last starting tag.

In addition to the first tag, each new start tag will be a child of the previous start tag. So the parser adds the new tag as a child of the previous start tag, and then pushes it to the top of the stack, which is the newest starting tag. On the other hand, if the input tag is an end tag, it must match the last start tag. An XML syntax error occurs when a mismatched end tag is based on the correct nesting rule. When the closing tag matches the last start tag, the parser pops the last start tag from the stack because the parsing of the tag is complete. This process continues until the stack is empty. At this point, you complete the parsing process for the entire document. Code Snippet 2 gives the entire source code for the Simpledomparser.parse method.

Simpledomparser.java

Package simpledomparser;

Import Java.io.Reader;
Import java.io.IOException;
Import java.io.EOFException;
Import Java.util.Stack;

public class Simpledomparser {
private static final int[] Cdata_start = {', '! ', ' ['], ' C ', ' D ', ' a ', ' T ', ' a ', ' ['};
private static final int[] Cdata_end = {'] ', ', ' ', ', '};

Private reader reader;
private Stack elements;
Private Simpleelement currentelement;

Public Simpledomparser () {
elements = new Stack ();
Currentelement = null;
}

Public simpleelement Parse (reader reader) throws IOException {
This.reader = reader;

Skipprologs ();

while (true) {
int index;
String TagName;

String Currenttag = Readtag (). Trim ();
if (Currenttag.startswith (" End tag
TagName = currenttag.substring (2, Currenttag.length ()-1);

No start tag
if (currentelement = = null) {
throw new IOException ("Got close tag" + TagName +
"' without open tag."
}

End tag does not match start tag
if (!tagname.equals (Currentelement.gettagname ())) {
throw new IOException ("Expected close tag for" +
Currentelement.gettagname () + "' but got '" +
TagName + "'.");
}

if (Elements.empty ()) {
End of text processing
return currentelement;
} else {
Eject the previous start tag
Currentelement = (simpleelement) elements.pop ();
}
} else {
Start tag or tag with start and end tags
index = Currenttag.indexof ("");
if (index 0) {
tags with no attributes
if (Currenttag.endswith ("/>")) {

TagName = currenttag.substring (1, Currenttag.length ()-2);
Currenttag = "/>";
} else {
Start tag
TagName = currenttag.substring (1, Currenttag.length ()-1);
Currenttag = "";
}
} else {
Tags with attributes
TagName = currenttag.substring (1, index);
Currenttag = currenttag.substring (index+1);
}

Creating elements
simpleelement element = new Simpleelement (tagName);

Profiling properties
Boolean istagclosed = false;
while (Currenttag.length () > 0) {

Currenttag = Currenttag.trim ();

if (Currenttag.equals ("/>")) {
End tag
Istagclosed = true;
Break
else if (Currenttag.equals (">")) {
Start tag
Break
}

index = currenttag.indexof ("=");
if (index 0) {
throw new IOException ("Invalid attribute for tag" +
TagName + "'.");
}

Get Property name
String AttributeName = currenttag.substring (0, index);
Currenttag = currenttag.substring (index+1);

Get property value
String AttributeValue;
Boolean isquoted = true;
if (Currenttag.startswith ("\")) {
index = Currenttag.indexof (' "', 1);
else if (Currenttag.startswith ("")) {
index = currenttag.indexof (' \ ', 1);
} else {
isquoted = false;
index = Currenttag.indexof (');
if (index 0) {
index = Currenttag.indexof (' > ');
if (index 0) {
index = Currenttag.indexof ('/');
}
}
}

if (index 0) {
throw new IOException ("Invalid attribute for tag" +
TagName + "'.");
}

if (isquoted) {
AttributeValue = currenttag.substring (1, index);
} else {
AttributeValue = currenttag.substring (0, index);
}

To add a property to a new element
Element.setattribute (AttributeName, AttributeValue);

Currenttag = currenttag.substring (index+1);
}

Reading text between the start and end tags
if (!istagclosed) {
Element.settext (ReadText ());
}

Add a new element as a child element of the current element
if (currentelement!= null) {
Currentelement.addchildelement (Element);
}

if (!istagclosed) {
if (currentelement!= null) {
Elements.push (currentelement);
}

Currentelement = element;
else if (currentelement = null) {
There is only one tag in the document
return element;
}
}
}
}

private int Peek () throws IOException {
Reader.mark (1);
int result = Reader.read ();
Reader.reset ();

return result;
}

private void Peek (int[] buffer) throws IOException {
Reader.mark (buffer.length);
for (int i=0; i Buffer[i] = Reader.read ();
}
Reader.reset ();
}

private void Skipwhitespace () throws IOException {
while (Character.iswhitespace (char) peek ()) {
Reader.read ();
}
}

private void Skipprolog () throws IOException {
Skip "yes" or " Reader.skip (2);

while (true) {
int next = peek ();

if (next = = ' > ') {
Reader.read ();
Break
else if (next = = "") {

Skipprolog ();
} else {
Reader.read ();
}
}
}

private void Skipprologs () throws IOException {
while (true) {
Skipwhitespace ();

int[] Next = new int[2];
Peek (next);

if (next[0]!= ' ") {
throw new IOException ("expected ' but got '" + (char) next[0] + "'.");
}

if ((next[1] = = '? ') | | (next[1] = = '! ') {
Skipprolog ();
} else {
Break
}
}
}

Private String Readtag () throws IOException {
Skipwhitespace ();

StringBuffer sb = new StringBuffer ();

int next = peek ();
if (Next!= ' ") {
throw new IOException ("expected but got" + (char) next);
}

Sb.append ((char) reader.read ());
while (Peek ()!= ' > ') {
Sb.append ((char) reader.read ());
}
Sb.append ((char) reader.read ());

return sb.tostring ();
}

Private String ReadText () throws IOException {
StringBuffer sb = new StringBuffer ();

int[] Next = new Int[cdata_start.length];
Peek (next);
if (Compareintarrays (Next, cdata_start) = = True) {
Cdata
Reader.skip (next.length);

int[] buffer = new Int[cdata_end.length];
while (true) {
Peek (buffer);

if (compareintarrays (buffer, cdata_end) = = True) {
Reader.skip (buffer.length);
Break
} else {
Sb.append ((char) reader.read ());
}
}
} else {
while (Peek ()!= ' ") {
Sb.append ((char) reader.read ());
}
}

return sb.tostring ();
}

Private Boolean Compareintarrays (int[] A1, int[] A2) {
if (a1.length!= a2.length) {
return false;
}

for (int i=0; i if (A1[i]!= a2[i]) {
return false;
}
}

return true;
}
}

For simplicity, Simpledomparser does not allow annotations to be used in XML documents and ignores XML declarations and DOCTYPE entirely. It uses the following program to skip the XML declaration and the DOCTYPE element. This program recursively calls itself, dealing with DOCTYPE as if it were an internal DTD.

private void Skipprolog () throws IOException {

Skip the "?" or "

Reader.skip (2);

while (true) {

int next = peek ();

if (next = = ' > ') {

Reader.read ();

Break

else if (next = = "") {

Skipprolog ();

} else {

Reader.read ();

}

Although the Simpledomparser described in this article has only limited functionality, it is still very useful for many simple applications. For example, a Java applet can use it to transfer data in XML format to a back-end server application. Because it is extremely lightweight, simpledomparser is more attractive in a very limited resource environment. In addition, the implementation of Simpledomparser is very simple. Although the current implementation can only save elements and cannot save declarations or DOCTYPE, you can modify it to handle the XML text you want to work with, and it's all very easy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Build your own lightweight XML DOM analyzer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Build your own lightweight XML DOM analyzer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support