A beginner's guide to decoding XML and dtd--writing well-formed and well-defined XML

Source: Internet
Author: User
Tags cdata closing tag how to create xml document xml example xml parser xsl

Level: Primary

Jane Fung (jcyfung@ca.ibm.com), VisualAge for Java support, IBM Canada

This introductory article of July 01, 2001 describes how to create XML document type definition (DTD) and well-formed XML files that can be validated by the XML parser you choose. While it is not necessary to include DTDs in every XML file that is generated, doing so will make your life much easier. The DTD enforces not only the syntax established for the XML file, it also allows the file to be parsed by the validation XML parser. Code samples include examples of DTDs and XML documents.

Extensible Markup language has been around for a very long time, so most people are now familiar with their most basic requirements: All XML documents must be both well-formed and valid. But how do you determine if your XML document meets these requirements? The short answer is that you don't have to be sure. Or at least it's not necessary. Most of the time, you will rely on the XML parser to manage these hard things for you.

After a few small surveys (see resources), you'll find that the market is littered with XML parsers, most of which are available free of charge on the Web. A basic XML parser emphasizes both the XML syntax rules (that is, ensuring that the file is well-formed) and the validity of the file. The XML parser can be used in almost every relevant computer language, including C, C + +, Perl, Python, TCL, and Java.

When it comes to making sure that the XML is well-formed, you can more or less point to a parser and then execute. However, to ensure that the document is valid, you need to provide a document type definition or DTD for the parser.

mode.

Recently, the consortium has pushed the XML Schema specification, which has long been discussed, into a "proposal" state, meaning it is likely to be widely used by developers. In some ways, XML schemas will replace DTDs. In other ways, DTDs are still the best solution. For a DeveloperWorks article that explains the XML Schema and compares it to the functionality and processing of DTDs, see Resources.

This article reviews what exactly the XML document format means, and then talks to a less discussed topic confirmation-more specifically, a DTD. I'll discuss why you need to include the DTD in an XML file, introduce some of the most common DTD syntax, and use a few simple samples to teach you to start writing your own DTD.

Why should it be formatted correctly?

When XML developers talk about well-formed and malformed XML, we are not involved in aesthetic discussions. Of course, well-formed XML documents are documents that meet the following three basic structural requirements: a parent (or root) element that contains all the other elements has an end tag for every start tag. All elements are nested correctly

Listing 1 is a well-formed XML example. Note that the parent element of the document is <person>, each start tag has a closing tag, and each closing tag has the exact same definition as its start tag. Typically, information or text is included between the start and end tags. However, in some cases, no information or text is included between the tags. The empty tag must end with a right slash. <nothing/> is an empty tag.
Listing 1. Well-Formed XML

<PERSON> 
<firstname>jane</firstname> <lastname>fung</lastname>
<nothing/>
</person>

Listing 2 is an example of an incorrectly formatted XML. It illustrates three common errors. First, the start and end <firstname> tags do not match exactly. Second, the <lastname> tag has no end tag. Finally, the empty tag does not end with a right slash.
Listing 2. Incorrectly formatted XML

<person>
<Firstname>Jane</firstname>
<lastname>fung
<nothing>
</person>





Back to the top of the page


What's in the DTD.

The advantage of XML is that it allows you to define your own meaningful markup, so you can customize the document to the fullest extent. But XML is XML (extensible), and people are people (crazy people), which may soon be out of control. The solution is a DTD, which specifies the markup for the XML document. In short, the DTD specifies the elements that can exist in the document, the attributes that those elements can have, the hierarchy of elements within the element, and the order in which the elements appear throughout the document.

Although DTDs are not required, they do bring convenience. DTDs are suitable for three basic purposes. It can: markup documentation to enhance consistency within the markup parameters so that the XML parser can confirm the document

If you do not make a DTD definition of an XML document, the document cannot be validated by the XML parser. Use XML Schema instances to replace DTDs. (See side bar mode.) Listing 3 is the DTD for the XML document shown in Listing 1.
Listing 3. A streamlined person.xml DTD

<! ELEMENT person (firstname, LastName) >
<! ELEMENT FirstName (#PCDATA) >
<! ELEMENT LastName (#PCDATA) >
<! ELEMENT Nothing empty>

A few notes on the example

The first line of the DTD in Listing 3 defines the parent element of the XML document: person. The person element has two child elements: FirstName and LastName.

The second and third lines contain the element attribute #PCDATA, which indicates that the FirstName and LastName elements may contain parsed character data (in this case, text). The last line of the DTD file describes an empty tag: nothing.

As you can see from the DTD in Listing 3, anyone who reads our XML document (and the parser that parsed it) knows that the person element contains only two text elements: FirstName and LastName. In addition, the DTD stipulates that in the entire document, the FirstName element must appear before the LastName element.

Before you go to a more complex example, let's review some of the most common DTD syntax elements. The complete DTD specification (see Resources) can be found on the Consortium home page.




Back to the top of the page


Quick Guide to DTD syntax

A, B, C, and D are variables that represent elements in the following example.

The element must have exactly one a, at least one B (indicated by the plus sign), 0 or more C (indicated by an asterisk), and 0 or one D (indicated by a question mark):

<! Element Element (A, B +, c*, D?) >

Element may have one or B or C:

<! Element Element (A | B | C) >

element does not contain any content:

<! Element element empty>

An element can contain any of the elements listed in the DTD:

<! Element element any>

The element may contain parsed character data or another element (Element2). An asterisk (*) represents a mixed content model-where elements can contain different types of properties.

<! Element Element (#PCDATA |element2) *>

The following example inserts the text "entity reference" anywhere in the document where it appears:

<! ENTITY element "ENTITY Reference" >

You can see that the entity references elements in an XML document are as follows:

&element;

The following example shows that its element is a null tag with three properties: Property 1 (ATT1) is an optional property, Property 2 (ATT2) is a property with a fixed value of "a", and property 3 (ATT3) is a required text property.

   <! Element element empty>


<! attlist element
ATT1 ID #IMPLIED
Att2 CDATA #FIXED "A"
ATT3 CDATA #REQUIRED >


You can see this element that is used in the XML document as follows:

<element att2= "A" att3= "Musthave"/>

The attribute CDATA indicates that the information included should be text. The ID attribute indicates that a unique identity must be filled in. Each element can have only one ID attribute. In addition, CDATA indicates that ATT2 and att3 may contain any strings.

If you are not fully familiar with the syntax, please continue reading. The work examples in the next section should help you eliminate doubts.




Back to the top of the page


Work example

You can use Microsoft Internet Explorer 5 or later to view the XML document shown in Listing 4-an extended version of the People.xml file used in the previous example. If you open People.xml in IE5, you should see a tree structure. This is because IE5 has an XML parser that can parse XML document syntax into an element tree.

You can also find this file and its DTD in resources.
a complete list of Listing 4. People.xml

<?xml version= "1.0"?>
<! DOCTYPE people SYSTEM "PEOPLE.DTD" >
<people>
<person>
<name>
<firstname>Jane</firstname>
<lastname>Fung</lastname>
</name>
<look>good-looking</look>
<possession>
<car>
<model>Civic</model>
</car>
<job>&IBM;</job>
</possession>
</person>
<person>
<name>
<firstname>G.I.</firstname>
<lastname>Jane</lastname>
</name>
<look>tough</look>
<possession>
<townhouse townhouse_type= "Good"/>
<bankaccount bankaccount_number= "sg-123" >
<! [cdata[<greeting>5000</greeting>]]>
</bankaccount>
</possession>
<other>
<car>she has a car</car>
<townhouse townhouse_type= "Good"/>
</other>
</person>
</people>

A few notes on XML

An in-depth discussion of XML is primarily concerned with several elements in the header of the document, starting with the following:

<?xml version= "1.0"?>

Each XML document must contain such a header, indicating to the XML parser that it is an XML document. The next line in the header tells the XML parser what character encoding the document was created with:

<! DOCTYPE people SYSTEM "PEOPLE.DTD" >

XML documents created on Unix systems may have different encodings than XML documents created on Windows systems.

You can also set the optional standalone property for the first row. The default value for standalone is no. The no value indicates that the DTD definition is described in another file. A Yes value indicates that the DTD should be defined within the XML document. I did not set this property for the example; If you want to set it, it should look like this:

   <?xml version= "1.0" standalone= ' Yes '?>
<! DOCTYPE people [
<! ELEMENT people (person+) >
<! ELEMENT person (#PCDATA) >
]>

You should also pay attention to the way that this document is formatted correctly. For example, all empty tags end with a right slash, as follows:

<townhouse townhouse_type= "Good"/>

Also note that CDATA is used to escape any data that is interpreted in an XML language without escaping, for example:

<! [cdata[<greeting>5000</greeting>]]>

If the format is appropriate, the line is displayed in textual content:

<greeting> 5000 </greeting>

You can benefit from further research of the XML file, and may even benefit from running the XML parser on your own files (see Resources). But now, let's take a look at the DTD for the People.xml file.
a complete list of Listing 5. PEOPLE.DTD

<! ELEMENT people (person+) 
<! ELEMENT person (name, look*, possession, and other?)
<! ELEMENT name (firstname, LastName)
<! ELEMENT FirstName (#PCDATA)
<! ELEMENT LastName (#PCDATA)
<! ELEMENT Look (#PCDATA)
<! ELEMENT possession (car, house, BankAccount, job?)
<! ELEMENT car (#PCDATA |model) *>
<! ELEMENT model (#PCDATA)
<! ELEMENT House (apartment|standalone|townhouse)
<! Attlist House House_area ID #IMPLIED Country CDATA #FIXED
"CANADA" City CDATA #IMPLIED
<! ELEMENT apartment Empty>
<! ELEMENT standalone empty>
<! ELEMENT townhouse empty>
<! Attlist townhouse townhouse_type ID #IMPLIED
<! ELEMENT BankAccount (#PCDATA)
<! Attlist bankaccount bankaccount_number ID #REQUIRED
<! ELEMENT Job (#PCDATA)
<! ELEMENT other any>
<! ENTITY IBM "Proud to work for IBM"

A few notes on DTDs

Using the Quick Guide as a reference, you should be able to easily define the relationships between the elements in the DTD and XML files by comparing the XML file with its DTD. However, there are two remaining elements that you may be interested in.

Listing 4 contains a reference to an entity.

<job>&IBM;</job>

Entity references are used instead of specific characters or strings defined in a DTD document. After parsing, the entity reference is read as follows:

<job> proud to work for IBM </job>

It should also be noted that the content type of the <other> tag is any. This means that <other> may contain all elements that have previously been declared in the DTD. Therefore, the other element may contain car and house elements, as follows:

   <other>
<car>she has a car</car>
<townhouse townhouse_type= "Good"/>
</other>





Back to the top of the page


Conclusion

This ends the basic introduction to the XML file that creates the correct format and definition. You may want to continue to study People.xml and people.dtd files yourself. If you want to try parsing these files using the XML parser, see Resources to find a list of the parser that you can download.

For reference, please refer to the English version of this article at the DeveloperWorks Global site.

If you want more information about XML and DTD syntax, XML 1.0 the recommendation should be your first stop.

Tim Bray is one of the original edits to the XML 1.0 specification. He maintains the textuality.com, where he can find his ideas about XML, DTDs, and so on. You can also find your own XML parser for Lark and Larval,bray.

Doug Tidwell's Tutorial: An Introduction to XML shows a near-complete discussion of "Extensible Markup Language."

You may also want to take a closer look at Mark Johnson's XML for the Absolute beginner published on Javaworld.

Please download the people.xml and people.dtd files used in this article for further research and analysis.
XML Syntax Analyzer: summaryIBM's XML Parser for Java (XML4J), currently a version 3.1.1, is a validation XML parser written in 100% pure Java. Package (Com.ibm.xml.parser) contains classes and methods for parsing, generating, manipulating, and confirming XML documents.

IBM's XML for C + + parser (XML4C) is based on the Apache xerces-c XML parser, which is a validation XML parser written in a portable subset of C + +.

Tclxml is a full Tcl XML parser.

Xerces is a Java parser from Apache Software Foundation, which is currently version 1.4.0.

Lars Marius Goshol is responsible for maintaining this exhaustive list of XML parsers and other XML tools as a public service.
RELATED LINKSView the latest information on the XML Zone page.

Like DTDs, style sheets are not required when you create XML files, but they are important if you want to control the display of your browser's Chinese files. Alan Knox's DeveloperWorks article, Style sheets can write style sheets, too, shows you how to use XSL to convert XML data into complex display tags for browsers.

After reading the above article, you may want to view the XSL Editor for IBM Alphaworks.

If you want to explore the XML schema and its relationship to DTDs, see David Mertz's DTD and XML Schema comparisons in its developerWorks column, "XML Problem 7," and Kevin Williams ' de Veloperworks temporary podium endorses the use of XmlSchema articles to understand the structured definition of XML documents for data. For a brief description of how the XML schema will be used, see introductory article Basics of using XML Schema to define elements.


About the author

Jane Fung is currently working for the IBM VisualAge for Java Technical Support Group, which supports enterprise developers who use VisualAge for Java. Jane received a bachelor's degree in Applied Science in electronic Engineering from the University of Ontario Prov., Canada, and is a Sun Java 2 certified programmer. You can contact Jane through jcyfung@ca.ibm.com.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.