About this tutorial
In this tutorial, we'll discuss how to use an XML parser to:
Working with an XML document
Create an XML document
Manipulating an XML document
We will also discuss some of the XML parser features that are useful and not known to everyone. Most importantly, every tool we discuss is available free of charge from IBM's Alphaworks site (www.alphaworks.ibm.com) and other sites.
Not discussed:
Some important programming concepts are not described here:
1 using visual tools to build XML applications
2 converting an XML document from one form to another
3 Creating interfaces for end users or other processes, and interfaces for storing data back-end
When you build an XML application, all of these concepts are important. We are preparing a new tutorial to discuss them, so please visit our website frequently!
XML Application Architecture
An XML application is usually built on an XML parser. It provides an interface for its users, as well as an interface for storing data on the back end.
This tutorial focuses on writing Java code that uses an XML parser to manipulate XML documents. As shown in the image below, this tutorial focuses on the middle piece.
Chapter II Foundation of the parser
Basis
An XML parser is a piece of code that can be read into a document and analyze its structure. In this section, we'll discuss how to use an XML parser to read an XML document. We will also discuss different types of parsers and when you use them.
Later chapters of this tutorial discuss what you can get from the parser and how to use these results.
How to use a parser
We'll discuss it in detail in a later section, but generally, you'll use it as follows:
1 Creating a Parser object
2 Passing your XML document to the parser
3 processing Results
Building an XML application is obviously much more than that, but usually an application of XML will include these processes.
Parser type
There are different ways to divide the parser species:
Validation or non-validation parsers
Parser that supports Document Object Model (DOM)
Parser that supports simple APIs for XML (SAX)
Parsers written in a specific language (Java, C + +, Perl, etc.)
Validation or non-validation parsers
As we mentioned in the first tutorial, XML documents that use a DTD and conform to the rules in the DTD will be referred to as valid documents (valid document). XML documents that conform to the basic tagging rules are called well-formed documents (well-formed document).
The XML specification requires all parsers to complain when they find that a document is not in the correct format. Validation (Validation) is another problem. The validation parser (validating parser) validates at the same time that the XML document is parsed. The non-validation parser (non-validating parser) ignores all validation errors. In other words, if an XML document is well-formed, an unauthenticated parser does not care whether the document conforms to the rules specified by its corresponding DTD (if any).
Why do I use a non-validation parser?
Speed and efficiency. It takes considerable overhead to have an XML parser that processes DTDs and ensures that each element of the XML conforms to the rules in the DTD. If you determine that an XML document is valid (possibly from a data source), there is no need to validate it at the time.
Again, sometimes all you need is to find the XML tag from a document. Once you have these tags, you can extract the data from it and then process it. If this is what you need, a non-validation parser is the right choice.
Document Object Model (DOM)
The Document Object model is a formal recommendation of the world Wide Web Consortium. It defines an interface that enables programs to access and update the style, structure, and content of XML documents. The XML parser that supports the DOM implements the interface.
The first edition of the specification, DOM Level 1, can be obtained from HTTP://WWW.W3.ORG/TR/REC-DOM-LEVEL-1 if you are willing to read the specification.
What the DOM parser can provide
When you parse an XML document with a DOM parser, you get a tree structure that contains all the elements in the document. DOM provides different capabilities to examine the content and structure of a document.
About Standard
Now that we are discussing the development of XML applications, we also need to focus on XML standards. Formally, XML is the product of MIT's trademark and World Wide Web Consortium (consortium).
The XML specification, the official recommendation of the Consortium, can be downloaded from the Www.w3.org/TR/REC-xml. The Web site contains specifications for XML, DOM, and a large stack of XML-related standards.
Simple APIs for XML (SAX)
The SAX API is another way to work with the content of an XML document. A fait accompli standard, which was developed by other members of the David Megginson and Xml-dev mailing lists.
To see the full SAX standard, see www.megginson.com/SAX/. To participate in the Xml-dev mailing list, send mail to majordomo@ic.ac.uk which contains: Subscribe Xml-dev.
What the SAX parser can provide
When you use the SAX parser to parse an XML document, the parser produces events at different places in the document. It is up to you to decide what to do with each event.
The SAX parser produces events when the document begins and ends, at the beginning and end of an element, or when it finds characters in an element, and several other points. You can write Java code to handle each event, and how to handle the information obtained from the parser.
When to use SAX? When do I use DOM?
We will discuss this in detail in a later section, but generally, you should use a DOM parser at the following times:
You need to be very knowledgeable about the structure of your document
You need to manipulate certain parts of the document (for example, you might want to sort some elements)
You need to use the information in your document more than once
You can use the SAX parser when you need to extract only a few elements from an XML document. The SAX parser is when you don't have most memory, or if you only need to use the information in the document once (rather than parsing the document once, and then use it again and again).
XML parsers in different languages
Most languages used on the Web have their corresponding XML parsers and libraries, including Java, C + +, Perl, and Python. The next page describes the links to parsers provided by IBM or other companies.
The vast majority of the examples in this tutorial are using IBM's xml4j parser. All the code we are discussing uses a standard interface. In the final chapter of this tutorial, we will show you how easy it is to write code that can use different parsers.
Java
IBM's parser, xml4j, can be obtained from www.alphaWorks.ibm.com/tech/xml4j.
James Clark's parser, XP, can be obtained from WWW.JCLARK.COM/XML/XP.
The Sun's XML parser can be downloaded from developer.java.sun.com/developer/products/xml/(you must become a member of the Java Developer Connection).
The xjparser of Datachannel can be obtained from xdev.datachannel.com/downloads/xjparser/.
C++
IBM's XML4C parser is available from www.alphaWorks.ibm.com/tech/xml4c.
James Clark's C + + parser, expat, can be obtained from www.jclark.com/xml/expat.html.
Perl
XML parsers with a variety of Perl languages. For more information, see www.perlxml.com/faq/perl-xml-faq.html.
Python
To get more XML parsers for the Python language, see www.python.org/topics/xml/.
Summarize
The core of any XML application is an XML parser. To process an XML document, your application creates a parser object, passes an XML document to it, and then processes the results returned from the parser object.
We discussed different types of XML parsers and why you chose one. We classify parsers in different ways:
Validation or non-validation parsers
Parser that supports Document Object Model (DOM)
Parser that supports simple APIs for XML (SAX)
Parsers written in a specific language (Java, C + +, Perl, etc.)
In our next section, we'll explore DOM parsers and how to use them.
Chapter III DOM (Document Object Model)
DOM, DOM, DOM, DOM, Dom,
Doobie, Doobie,
DOM, DOM, DOM, DOM, Dom ...
DOM is a common interface for manipulating document structures. One goal of this design is that Java code written for a DOM-compliant parser should be able to use any other DOM-compliant parser without having to modify the code. (We'll show you this later.) )
As we mentioned earlier, a DOM parser will return the structure of your entire document in a tree form.
Sample code
Before we proceed, please download our sample XML application. untie This file xmljava.zip, it's OK! (blueski:*** or view the appendix to this tutorial)
DOM interface
The DOM defines multiple Java interfaces. The following are common:
Node:dom the basic data type.
Element: The object you will most primarily handle is element.
Attr: An attribute that represents an element.
Text: The actual content of an element or Attr.
Document: Represents the entire XML document. A Document object is often referred to as a DOM tree.
Common DOM Methods
When you use the DOM, the following are the methods that you will often use:
Document.getdocumentelement ()
Returns the root element of a document.
Node.getfirstchild () and Node.getlastchild ()
Returns the first child of a given Node.
Node.getnextsibling () and node.getprevioussibling ()
It deletes everything in the DOM tree, formats your hard disk, and then sends an abusive message to everyone in your address book. (It's not true.) These methods return the sibling of the next or previous given Node. )
Node.getattribute (Attrname)
Returns the property of the given name for the given Node. For example, if you want to get an object named ID attribute, you can call GetAttribute ("id").
Our first DOM app!
Introduced a lot of concepts, let's go on. Our first application simply reads an XML document and outputs its content to standard output.
In a command line window, run the following command:
Java Domone sonnet.xml
This command will be loaded into our application and then let it parse the Sonnet.xml file. If everything works, you will see that the contents of the XML document are output to standard output.
<?xml version= "1.0"?>
<sonnet type= "Shakespearean" >
<author>
<last-name>Shakespeare</last-name>
<first-name>William</first-name>
<nationality>British</nationality>
<year-of-birth>1564</year-of-birth>
<year-of-death>1616</year-of-death>
</author>
<title>sonnet 130</title>
<lines>
<line>my Mistress?eyes are ...
Domone Analysis
Domone's source is very straight. We create a new class Domone; it has two methods, Parseandprint and Printdomtree.
In the main method, we process the command line, create a Domone object, and then pass the file name to the Domone object. The Domone object creates a parser object, parses the document, and then processes the DOM tree (or Document object) through the Printdomtree method.
We will look at each step in detail.
public class Domone
{
public void Parseandprint (String uri)
...
public void Printdomtree (node node)
...
public static void Main (String argv[])
...
Processing command line
The code that handles the command line is displayed on the left. We will check to see if the user entered parameters on the command line. If not, we print the use method and roll it out; otherwise, we assume that the first argument on the command line (argv[0 in the Java language)) is the document name. We ignore other parameters that the user may have entered.
We use command line arguments to simplify our example. In most cases, an XML application can be used with a servlet, Java Bean, and other types of components, and command-line arguments are not a problem.
public static void Main (String argv[])
{
if (argv.length = 0)
{
System.out.println ("Usage: ...) ");
...
System.exit (1);
}
Domone D1 = new Domone ();
D1.parseandprint (Argv[0]);
}
Create a Domone object
In our sample code, we create a separate class Domone. To parse the file and print the results, we create an instance of the Domone class, and then let the Domone object we just created parse and print the XML document.
Why are we dealing with this? Because we want to use a recursive method to traverse the DOM tree and print out the results. We can't handle it with a static method like main, so we create a separate class to handle it.
public static void Main (String argv[])
{
if (argv.length = 0)
{
System.out.println ("Usage: ...) ");
...
System.exit (1);
}
Domone D1 = new Domone ();
D1.parseandprint (Argv[0]);
}
Create a Parser object
Now that we have domone instance to parse and process our XML document, the first process is to create a new Parser object. In this example, we'll use a Domparser object, a Java class that implements the DOM interface. There are other parser objects in the XML4J package, such as SAXParser, Validatingsaxparser, and Nonvalidatingdomparser.
Notice that we put this code in a try module. Parser in some cases will throw an exception (exception), including an invalid URI, no DTD found, or an XML document that is either not valid or malformed. To handle it well, we're going to catch the exception (exception).
Now that parsing is done, we'll walk through the DOM tree. Note that this code is recursive. For each node, we deal with itself, and then we recursively call the Printdomtree method on each node's children. Recursive calls are shown on the left.
Remember that when some XML documents are very large, they do not have too many layers of markup. Take a telephone book in Shanghai as an example, there may be millions of records, but its markings may not be more than a few layers. For this reason, stack overflow of recursive algorithms is not a problem.
public void Printdomtree (node node)
{
int nodeType = Node.getnodetype ();
Switch (nodeType)
{
Case Document_node:
Printdomtree ((Document) node).
Getdocumentelement ());
...
Case Element_node:
...
NodeList children =
Node.getchildnodes ();
if (children!= null)
{
for (int i = 0;
I < children.getlength ();
i++)
Printdomtree (Children.item (i));
}
A lot of Node
If you view sonnet.xml, there are 24 nodes. You might think this means 24 nodes. However, this is not true. There are 69 nodes in the Sonnet.xml, one document node, 23 element node, and 45 text node, and the other. We run the Java domcounter sonnet.xml to get the results shown below.
Domcounter.java
This code parses an XML document and then traverses the DOM tree to collect data about the document. When data is collected, it is output to standard output.
Statistics for Sonnet.xml data:
====================================
Document nodes:1
Element nodes:23
Entity Reference nodes:0
CDATA sections:0
Text nodes:45
Processing instructions:0
----------
total:69 Nodes
Example of a node column
For the bottom fragment,
<sonnet type= "Shakespearean" >
<author>
<last-name>Shakespeare</last-name>
The following are the nodes returned from the parser:
Document node
The Element node corresponds to the <sonnet> tag
A Text node corresponds to the return character after the <sonnet> node and the two spaces before the <author> tag
The Element node corresponds to the <author> tag
A Text node corresponds to the return character after the <author> node and the four spaces before the <last-name> tag
The Element node corresponds to the <last-name> tag
All those text nodes
If you look at the list of all the nodes returned by the parser, you will find that most of them are useless. The spaces at the beginning of each line consists of a Text node that can be ignored.
Note that if you put all the nodes on one line, we would not get these useless nodes. We increase the readability of the document by adding line breaks and spaces.
You can omit line breaks and spaces when you build an XML document without having to consider readability. This makes your documents smaller, and you don't need to build those useless nodes to process your documents.
All those text nodes
If you look at the list of all the nodes returned by the parser, you will find that most of them are useless. The spaces at the beginning of each line consists of a Text node that can be ignored.
Note that if you put all the nodes on one line, we would not get these useless nodes. We increase the readability of the document by adding line breaks and spaces.
You can omit line breaks and spaces when you build an XML document without having to consider readability. This makes your documents smaller, and you don't need to build those useless nodes to process your documents.
<sonnet type= "Shakespearean" >
<author>
<last-name>Shakespeare</last-name>
<first-name>William</first-name>
<nationality>British</nationality>
<year-of-birth>1564</year-of-birth>
<year-of-death>1616</year-of-death>
</author>
<title>sonnet 130</title>
<lines>
<line>my mistress ' eyes are nothing like the sun,</line>
A Text node corresponds to the "Shakespeare" character
If you see all the spaces between tags, you can find out why we have so many nodes beyond your imagination.
Get to know your Node
The last thing we want to do with node in the DOM tree is that we check the type of each node before we process it. Some methods, such as getattributes, return null values for certain node types. If you do not check the node type, you will get incorrect results (best case) and exception (worst case).
The switch statements described here often appear in code that uses the DOM parser.
Switch (nodeType)
{
Case Node.document_node:
...
Case Node.element_node:
...
Case Node.text_node:
...
}
Summarize
Believe it or not, that's all we need to know about using DOM objects. Our Domone code completes the following tasks:
Create a Parser object
Pass an XML document to the Parser to parse
Obtain the Document object from Parser and check it.
In the last chapter of this tutorial, we'll discuss how to build a DOM tree without the XML original file, and show how to sort the elements in an XML document. And those are based on the concepts we're talking about here.
Before we continue with those more advanced applications, we will explore the SAX API in detail. We will also use similar examples to show the differences between SAX and DOM.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.