Implementing XMLReader interface using XML: Completing XI

Source: Internet
Author: User
Tags add emit end expression string tag name version trim
XML columnist Benoit Marchal continues to describe XI, which is an open source project that converts old text into XML. To improve efficiency, Xi now implements the SAX XMLReader interface, which proves that the interface makes XI link to an XSLT processor easier. The code sample demonstrates these techniques and can also get the complete source code. The column reports every month that the author aims to help like-minded XML developers, especially those using Java technology, to open source projects.
In the last two columns, I've been talking about XI (the abbreviation for XML Import), which is a project that converts old files into XML (see Resources). The motivation for XI comes from the need to publish the Address Book as part of an XML site. Because the address book is maintained in a proprietary format of the e-mail client, I need to convert the text into XML.

I took this opportunity to try a new regular expression library built into JDK 1.4. Regular expressions tend to be a flexible conversion solution: I can describe how to parse old documents into a set of regular expressions rather than hard-coded the conversion routines. I'll use a set of rules for the Address book, but I can write different rules for other calendars or for chemical analysis data, WEB server logs, or other formats. XI is a more general tool that you or I can use and reuse in many projects.

Now in the XML
In the last column, "Wrestling with Java NIO" (see Resources), I spent a considerable amount of time studying the regular expression library. The result is that some of my assumptions are totally irrelevant, but I managed to use regular expressions to parse the address book into elements.

Because my goal is to give a general solution, I have created a small data structure to hold this set of rules. It causes the XML tag name to be intrinsically associated with the regular expression. Although I had to be limited to a fixed data structure for testing, I organized the code so that it would be a simple matter to populate the data structure from one file, which is one of the features that I have now implemented.

This column is about clearing the code and making sure it produces a valid XML document. I also managed to encapsulate the existing algorithm into an XML parser. As you will see, the XML parser interface proves that it makes it easy to process an XSLT processor.

The best way to write XML documents
The easiest solution to completing XI is to revisit the code and modify various print statements to write XML tags. Indeed, the logic that parses the document and associates the XML element with the node has long been present. For example, when logic matches a regular expression, the algorithm prints the elements associated with it, such as:

System.out.print (Ruleset.getMatchAt.getQualifiedName ());


It is not difficult to modify this statement to produce the correct XML:

System.out.print ("<" +ruleset.getmatchat.getqualifiedname () + ">");


Of course, the above statement only prints the start tag, so I need more print statements to end tags and content, but that's not hard to do.

If I'm only interested in writing XML documents, it might be something I'd like to do, because it's the least laborious solution. Pay special attention to avoiding the use of angle brackets, &, and other reserved characters, but these are trivial things. I might also want to save the XML document in a file instead of printing to the console-but again, it's trivial.

However, I am not willing to write XML documents to the file. As you can recall the previous columns, I'm not going to use XI's output directly. Experience has shown that people often need to rearrange old documents. For example, in an address book, I must group the alias and the note line together. I can add logic to XI to handle this and other similar situations, but I find it advantageous to split the import process into two steps:

Syntax conversions
Data structure reorganization
Syntax transformations get text information and encapsulate it with the simplest XML structure. Typically, the resulting XML document is very close to the original file. In most cases, it is as simple as replacing delimiters with XML tags. This is what XI did.

The second step is to convert the original XML document into a target glossary using transformations. I found that XSLT is especially good for this purpose because it is a powerful translation language. And because XSLT is a standard, there is no lack of support tools, such as editors.

In short, I don't necessarily require XI to write an XML document in a file; I prefer to optimize it to interact with the XSLT processor. JDK 1.4 carries a version of the Apache Xalan that accepts input from file (stream), SAX events, and DOM trees. Of these three interfaces, I personally like SAX best.

SAX is attractive because it is easy to program and has a fairly efficient interface when working with XML documents. Compared to a file, it saves the written content to a temporary file, which requires less memory than the DOM.

Programming a SAX interface
In the remainder of this article, I assume you are familiar with SAX programming. If you are unfamiliar, you may want to go to the sax,the power API (see Resources) on the same developerWorks.

Two of the most important interfaces in SAX are XMLReader and ContentHandler. XMLReader describes how to initialize and start an XML parser, while ContentHandler lists the events that XMLReader emit when it parses an XML document.

When reading an XML document, you may have used both interfaces. However, even if you are familiar with them, the application requires that you view SAX from a slightly different perspective. In this case, I write my own parser instead of being a user of SAX. Strictly speaking, XI is not an XML parser; it does not read XML documents. However, it provides an XML view of the text document, so it conforms to the XMLReader interface.

The SAX implementations of XI are in class Xireader. The class is too large to copy it completely here. Before proceeding, I encourage you to obtain a copy from the "Open source" section of DeveloperWorks (see Resources).

Xireader handles two problems: implementing SAX Interface and actual text parsing, and XML document generation. Listing 1 illustrates the implementation of this interface.

SAX implementation of Listing 1:xireader

public class Xireader

Implements XMLReader, Locator

{

protected ContentHandler ContentHandler = null;

Public ContentHandler Getcontenthandler ()

{

return ContentHandler;

}

public void Setcontenthandler (ContentHandler value)

Throws NullPointerException

{

if (value = = null)

throw new NullPointerException ("ContentHandler");

Else
ContentHandler = value;

}

// ...

}


To support Xmlreader,xireader provides the following methods for registering and accessing various SAX handlers: ContentHandler, ErrorHandler, Dtdhandler, and Entityresolver.

Strictly speaking, Dtdhandler and entityresolver are not used: The old text has no DTD, so Xireader will never issue a DTD-related event.

Again, there is no need to use entityresolver, and if you look back, the parser should not be using it for top-level document entities. This interface is useful only for external entities (such as DTDs)! Therefore, it is not used for old text documents. Still, SAX is the authorization method to set up and get both handlers, and Xireader forced to do so.

Xireader also implements limited support for SAX functionality and features. Features and features control all aspects of parsing; they are identified by URLs such as Http://xml.org/sax/features/namespaces. Note that URLs serve only identifiers, so don't try to open them. (Please do not visit the Web site-there are no accessible sites.) )

The specification declaration XMLReader must support setting the Http://xml.org/sax/features/namespaces function to True (support false is optional) to http://xml.org/sax/features/ Namespace-prefixes set to False (true is optional).

The first function controls whether the parser decodes the XML namespace (true). Xireader always use namespaces. The second feature controls whether a namespace declaration is reported in the property (True) list. Xireader supports two values.

As you can see, Xireader provides minimal consistency. Still, I find it necessary to support setting Http://xml.org/sax/features/namespace-prefixes to True (as the specification requires, not just false), because Apache Xalan needs to set the attribute to True to correctly handle namespaces.

The specification defines other features and their URLs, but does not require the parser to support them. Because most of these features deal with validation and XML schemas, I choose to omit them.

I also defined a new feature http://ananas.org/xi/features/rulesets to provide the parser with its rule file. This attribute accepts the InputSource value that points to the rule file.

ContentHandler and parsing
In the code discussed in the previous column "Wrestling with Java NIO" (see Resources), a lot of processing occurs in a method named Read (). I rename it to match () to improve readability and modify it to call ContentHandler when decoding the input document. Listing 2 illustrates this operation. If you compare this code with the code in "Wrestling with Java NIO", you will find that the structure is very similar. The only important difference is that the print () statement has been replaced by various calls to ContentHandler.

List 2:match () and ContentHandler

public void Match (Ruleset ruleset,string St,boolean firstmatch)

Throws Saxexception

{

Attributes.clear ();

int i = 0;

while (I < Ruleset.getmatchcount ())

{

if (Ruleset.getMatchAt.matches (ST))

{

Match match = Ruleset.getmatchat;

if (firstmatch && contenthandler!= null)

Contenthandler.startelement (Match.getnamespaceuri (),

Match.getlocalname (),

Match.getqualifiedname (),

attributes);

for (int j = 1;j <= Match.getgroupcount (); j + +)

{

QName QName = Match.getgroupnameat (j);

Ruleset Nextruleset = (Ruleset) rulesetsmap.get (QName);

if (Nextruleset!= null)

Match (Nextruleset,match.getgroupvalueat (j), true);

Else
{

Group Group = Match.getgroupnameat (j);

if (ContentHandler!= null)

{

Contenthandler.startelement (Group.getnamespaceuri (),

Group.getlocalname (),

Group.getqualifiedname (),

attributes);

String value = Match.getgroupvalueat (j);

int begin = 0,

end = 0;

while (Begin < Value.length ())

{

if (Value.length ()-Begin < Chars.length)

End = Value.length ();

Else
End = begin + Chars.length;

Value.getchars (begin,end,chars,0);

Contenthandler.characters (Chars,0,end-begin);

begin = END;

}

Contenthandler.endelement (Group.getnamespaceuri (),

Group.getlocalname (),

Group.getqualifiedname ());

}

}

}

String rest = Match.rest ();

if (rest!= null)

Match (Ruleset,rest,false);

if (firstmatch && contenthandler!= null)

Contenthandler.endelement (Match.getnamespaceuri (),

Match.getlocalname (),

Match.getqualifiedname ());

Break

}

Else
i++;

}

if (I < Ruleset.getmatchcount ()

&& ruleset.geterror ()!= null

&& ErrorHandler!= null)

Errorhandler.error (Ruleset.geterror (), New Saxparseexception (),

This));

}


XM is the publication project that was introduced in the first XML column, and if you remember it, you'll be familiar with the events that emit ContentHandler. XM does this to fix the dangling hyperlinks. Xireader is built on the same logic, but it's more ambitious. It emits enough events to describe a complete document, rather than sending an event for the link.

I admit that I initially wanted to write a complete XMLReader implementation. But, as this column shows, it's simply too easy ... and it just proves that SAX did, as its name suggests, a simple API for XML, as it does.

The use of ContentHandler is particularly simple. Consider that you usually have a way to print start and end tags and content. These methods deal with Word escape, indentation, and other syntax-related problems. ContentHandler basically defines these methods for you. Use the Startelement () method to print the start tag, print the end tag using the EndElement () method, and print the content using the characters () method.

Reading rule files
After establishing the XMLReader, I wanted to enable XI to read the rules file. I'm not the only XI application that can handle the address book, so I want to break the shackles of hard-coded regular expressions.

I've mainly kept the glossary that I introduced in the first two columns. The rule file is similar to Listing 3. The root element is rules; it contains one or more ruleset elements.

Each ruleset contains a match list that represents the regular expression. The error element describes in detail what to do when XI cannot match any regular expression. Finally, the group element represents the groups in the regular expression. Connected to each element is the element name, which is the name used by XI.

List 3:rules.xml

<?xml version= "1.0"?>

<xi:rules version= "1.0"

Xmlns:xi= "Http://ananas.org/2002/xi/rules"

Defaultprefix= "an"

Targetnamespace= "Http://ananas.org/2002/sample" >

<xi:ruleset name= "Address-book" >

<xi:match Name= "Alias"

Pattern= "^alias (. *). *) $" >

<xi:group name= "id"/>

<xi:group name= "Email"/>

</xi:match>

<xi:match name= "Note"

Pattern= "^note. *.*" $ >

<xi:group name= "Fields"/>

</xi:match>

<xi:error message= "Unknown line type"/>

</xi:ruleset>

<xi:ruleset name= "Fields" >

<xi:match name= "Field"

Pattern= "[\s]*< ([^<]*) >" >

<xi:group name= "Field"/>

</xi:match>

</xi:ruleset>

</xi:rules>


I made a change between the original glossary and the Glossary in Listing 3: The document now supports a global namespace that applies to the entire rule file. My initial idea was to have the user specify multiple namespaces in the rules file, but it makes xireader unnecessarily complicated.

As I delve into the problem, I realize that the global namespace meets 99% of all the requirements. But what if you really need multiple namespaces? You can still use it, because in any case the document is always processed in XSLT. Adding a new namespace to the style sheet is a simple thing to do.

Use HC to fix it.
One of the pleasures of writing this column is that I can reuse the project when I continue the column. In this case, I use the HC, handler compiler (Handler Compiler) introduced a few months ago to simplify parsing rule files.

If you have not read the appropriate column, HC is the XPaths Java class that gets the annotation and converts it to a SAX ContentHandler compiler. Each method in the class matches one or more XPath. In fact, it eliminates the writing of many tedious state-management codes.

Listing 4 is the handler for the rule file. You can see those XPath in the Javadoc annotation. The handler defines a method for each element in the rule glossary. When it traverses a rule file, it fills the data structure with a regular expression.

List 4:ruleshandler.java

Package org.ananas.xi;

Import java.util.*;

Import org.xml.sax.*;

/**

* @xmlns XI Http://ananas.org/2002/xi/rules

*/

public class Ruleshandler

Implements Org.ananas.hc.HCHandler

{

Private String NamespaceURI = null;

Private String prefix = null;

Private List rulesets = null;

Private Ruleset Getlastruleset ()

{

Return (Ruleset) Rulesets.get (Rulesets.size ()-1);

}

/**

* @xpath Xi:rules

*/

public void init (Attributes Attributes)

{

RuleSets = new ArrayList ();

NamespaceURI = Attributes.getvalue ("targetnamespace");

prefix = attributes.getvalue ("Defaultprefix");

if (NamespaceURI!= null)

{

NamespaceURI = Namespaceuri.trim ();

if (Namespaceuri.equals (""))

NamespaceURI = null;

}

if (prefix!= null)

{

prefix = Prefix.trim ();

if (Prefix.equals (""))

prefix = null;

}

}

/**

* @xpath Xi:rules/xi:ruleset

*/

public void Doruleset (Attributes Attributes)

Throws Saxexception

{

String name = Attributes.getvalue ("name");

if (name!= null)

Rulesets.add (New Ruleset (NamespaceURI,

Name

prefix));

Else
throw new Saxexception ("Name attribute required for Xi:ruleset");

}

/**

* @xpath Xi:rules/xi:ruleset/xi:match

*/

public void Domatch (Attributes Attributes)

Throws Saxexception

{

String name = Attributes.getvalue ("name"),

Pattern = Attributes.getvalue ("pattern");

if (name!= null && pattern!= null)

{

Ruleset Ruleset = Getlastruleset ();

Ruleset.addmatch (New Match (NamespaceURI,

Name

Prefix

pattern));

}

Else
throw new Saxexception ("Name and pattern attributes" +

"Required for Xi:match");

}

/**

* @xpath Xi:rules/xi:ruleset/xi:error

*/

public void Doerror (Attributes Attributes)

Throws Saxexception

{

String message = attributes.getvalue (' message ');

if (message!= null)

{

Ruleset Ruleset = Getlastruleset ();

if (ruleset.geterror () = null)

Ruleset.seterror (message);

Else
throw new Saxexception ("No more than one error per Xi:ruleset");

}

Else
throw new Saxexception ("Message attributes required for Xi:error");

}

/**

* @xpath Xi:rules/xi:ruleset/xi:match/xi:group

*/

public void Dogroup (Attributes Attributes)

Throws Saxexception

{

String name = Attributes.getvalue ("name");

if (name!= null)

{

Ruleset Ruleset = Getlastruleset ();

Match match = Ruleset.getlastmatch ();

Match.addgroup (New Group (NamespaceURI,

Name

prefix));

}

Else
throw new Saxexception ("Name attribute required for Xi:group");

}

Public ruleset[] Getrulesets ()

{

ruleset[] Array = new ruleset[rulesets.size ()];

Return (ruleset[]) Rulesets.toarray (array);

}

Public String Getnamespaceuri ()

{

return NamespaceURI;

}

Public String Getprefix ()

{

return prefix;

}

}


Until the next time
Work with XI is nearing completion. Now that you have a running processor, as shown in Listing 5, it's easy to interact with the XSLT processor. In the next column, I'll introduce a simple user interface around the existing core to make XI more useful.

Listing 5: Sample Main method

public static void Main (string[] params)

Throws Transformerexception, Transformerconfigurationexception,

Saxexception, IOException

{

InputSource InputSource = new InputSource (new FileInputStream (params[0));

Inputsource.setsystemid (Params[0]);

XMLReader XMLReader =

Xmlreaderfactory.createxmlreader ("Org.ananas.xi.XIReader");

Xmlreader.setproperty (xireader.rulesets_uri,new inputsource ("Rules.xml"));

Transformerfactory factory = Transformerfactory.newinstance ();

Transformer Transformer = Factory.newtransformer ();

Transformer.transform (New SAXSource (Xmlreader,inputsource), new

Streamresult ("Result.xml"));

}


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.