11th Chapter Embedding non-XML data

Source: Internet
Author: User
Tags date format empty end final connect string version
xml| Data

XML Tutorials

Not all the data in the world is in XML format. In fact, it can be boldly said that most of the data accumulated in the world is not in XML format. A large amount of data is saved in unformatted text, HTML, and Microsoft Word format, and there are only three commonly used non-XML formats. In theory, if interested and financially permissible, at least most of the data can be rewritten as XML, but not all of the data is available. Encoding an image into an XML format, for example, will result in extremely inefficient processing.

XML provides three structures: tokens, non-analytical external entities, and processing instructions, which are typically used to process data that is not XML-formatted. Notation describes non-XML format data; An external entity provides links to the actual location of non-XML format data, and processing instructions give information on how to view the data.

There are still many controversies about the specific contents of this chapter. Although everything I say conforms to the XML 1.0 specification, it is not everyone who agrees with this view. You can definitely write an XML document that does not use annotations and external objects, with only a few simple instructions. You can skip this chapter and then go back to this chapter when you find it necessary to understand this content.

The main contents of this chapter are as follows:

* Mark

* Non-analytical external entities

* Processing Instructions

* The conditional part of the DTD

11.1 marks.

The first problem that you will encounter in using non-XML format data in an XML document is to identify the data format and inform the XML application how to read and display the non-XML format data. For example, attempting to draw a MP3 sound file on the screen is inappropriate.

Using only a fixed set of tags for a particular kind of external entity within a limited scope, you can solve the problem of reading and displaying external non-XML data in a single application. For example, if all picture data is embedded through an image element, and all sound data is embedded through the avdio element, it is not difficult to develop a browser that knows how to handle these two elements. This is actually the approach of HTML, but such a method does not allow document authors to create new tags in order to more clearly describe what they need, such as the person element that happens to have a photo attribute that points to the man's JPEG format picture.

Furthermore, none of the applications can understand all possible file formats. Most web browsers can manage and read GIF, JPEG, PNG image files, and perhaps some other format image files, but they have nothing to do with EPS, TIFF, fits files, and are more than hundreds of popular and special image formats. The dialog box in Figure 11-1 may be familiar.

Figure 11-1 What happens when Netscape Navigator is unable to recognize a file type

Ideally, you want the document to notify the application of the format of the external entity, so you don't have to rely on the application to recognize the file type, or by magic numbers or unreliable file extensions. In addition, if the application itself cannot handle images of this format, it can also provide the application with clues about what programs can be used to display images.

notation provides a partial solution to this problem (although it does not receive good support). Notation describes the format of non-XML data. In DTDs, a notation declaration prescribes a particular data type. DTDs declare tokens on the same level as elements, attributes, and entities. Each token declaration contains a name and an external identifier, and the syntax structure is as follows:

<! notation name SYSTEM "Externalid" >

Name is the special format identifier used in the document; Externalid is the meaningful string used to identify the token. For example, an entity GIF image can be marked with a MIME type:

<! notation GIF SYSTEM "Image/gif" >

You can also use public instead of the system identifier, so you must provide a public ID and a URL. For example:

<! Notation GIF Public

"-//ietf//nonsgml Media Type image/gif//en"

"Http://www.isi.edu/in-notes/iana/assignments/media-types/image/gif" >

There is a heated debate about how to make an external identification accurately. MIME types like image/gif, text/html are a possibility; another suggestion is to select URLs, or other standard document positioning methods-like http://www.w3.org/TR/REC-html140/. The third option is to use formal international standards, such as the ISO 8601 standard, which represents dates and times. In some cases, it may be more appropriate for the ISBN or the Library of Congress to catalogue the documentation. In addition, there are many other options.

The way you choose depends on your expectations for the document's lifetime. For example, if you choose an unusual format, you can't rely on a URL that changes every month, and if you want your document to be active for 100 years, consider using identifiers that are meaningful for 100 years instead of using a technology that has only 10 years of vitality.

You can also use tokens to describe the data that you insert into your document. For example, study the following data element:

<DATE> 05-07-06 </DATE>

What day does 05-07-06 mean? Is it A.D. May 7, 1906 or A.D. July 5, 1906? The answer depends on whether the date is understood in the American format or in the European format. It could even be May 7, 2006 or July 5, 2006. Or it was May 7, 6, the fall of the Roman Empire at the height of the west and the Han Dynasty of China. It is also possible that this date is not an A.D., but a Jewish calendar, a Muslim calendar or a Chinese lunar calendar. Without more information, it is impossible to determine its true meaning.

To avoid this confusion, the ISO 8601 standard provides a precise method for representing the date. Using this method, in XML, July 5, 2006 is written as 20060705, or in the following format:


This format is not the same as the idea of everyone, the same degree of confusion for all, not favouring any one culture (in fact, still biased towards the traditional Western calendar).

Declares a token in a DTD and uses a notation property to describe the format of the non-XML data embedded in the XML document. Then look at the date example, listing 11-1 defines two date tokens: ISO 8601 and the American idiomatic format. The required format property of the notation type is then added to each date element to describe the structure of the particular element.

Listing 11-1:iso 8601 and American idiomatic format date elements

<?xml version= "1.0" standalone= "yes"?>


<! Notation Isodate SYSTEM

"Http://www.iso.ch/cate/d15903.html" >

<! Notation Usdate SYSTEM


. html >

<! ELEMENT SCHEDULE (appointment*) >

<! ELEMENT Appointment (note, DATE, time?) >



<! ELEMENT time (#PCDATA) >

<! Attlist DATE FORMAT notation (isodate | Usdate) #IMPLIED >




<note>deliver presents</note>

<date format= "ISDATE" > 12-25-1999 </DATE>



<note>party Like it s 1999</note>

<date format= "Isodate" >19991231</DATE>



The notation cannot force the author to use the format described by the notation. It is therefore necessary to provide several language solutions other than the basic XML method-but the notation method is effective in a simple application where the author can correctly describe the date.

11.2 Non-analytical external entities

XML format is not an ideal format for all data, especially for non text data. For example, you can save each pixel of a bitmap image as an XML element in the way shown below:

<pixel x= "y=" color= "Ff5e32"/>

However, this is certainly not a good idea. Any minor error causes a severe imbalance in the ratio of balloon image files. XML is and will never be able to make XML documents have access to data, so it is impossible to encode all the data in XML.

A typical web page can refer to GIF and JPEG images, Java applets, ActiveX controls, various types of sounds, and so on. In XML, because an XML processor does not attempt to understand a block of data in a non-XML format, the blocks of data are called an irreducible entity. At most, the XML processor notifies the application that there is such an entity and provides the application with the entity name and what the entity might contain (but this is not an action that must be performed).

HTML pages embed non-HTML entities through a variety of custom tags. A picture is referenced by a tag that has a src attribute, and the Url;java program that provides the image file is included by the <APPLET> tag with the class and CodeBase attributes, The class and CodeBase properties point to the file and directory;<object> tags saved by the Java program to embed the CodeBase property reference, where the URI of the target data can be found. In each case, a specific predefined token represents a specific content. The predefined property contains the URL of its content.

XML applications can but do not have to operate this way, and in fact most XML applications do not do this except deliberately to maintain compatibility with backward HTML. Instead, XML applications reference these content using an non-analytical external entity. An non-analytical external entity provides links to the actual location of non-XML data. Then a specific element in the document uses its entity attribute to connect to the entity.

11.2.1 declares an non-analytical entity

Recalling the contents of chapter 9th, the Declaration of an external entity looks as follows:

<! ENTITY SIG SYSTEM "Http://metalab.unc.edu/xml/signature.xml" >

However, this format can only be accepted if the external entity specified by the URL is just the exact XML document. If the external entity is not XML, the entity type has to be specified using the Ndata keyword. For example, in order to connect the GIF format file logo.gif with a logo name, you need to place the following entity declaration in the DTD:

<! ENTITY LOGO SYSTEM "logo.gif" Ndata gif>

The final name in the declaration must be the name of the token declared in the DTD, such as the GIF in this example. The notation relates the name of the GIF class to a type of external identifier, and the external identifier identifies a format. such as MIME type, ISO standard, or URL of format specification. For example, a GIF's notation resembles the following form:

<! notation GIF SYSTEM "Image/gif" >

Typically, as a customary representation, you can use an absolute or relative URL to point to an external entity. For example:

<! ENTITY LOGO SYSTEM "logo.gif"

Ndata gif>

<! ENTITY LOGO SYSTEM "/xml/logo.gif" Ndata gif>

<! ENTITY LOGO SYSTEM ". /logo.gif "Ndata gif>

11.2.2 Embedded non-analytical entities

You cannot embed an analytical entity with a generic entity reference, simply embedding an non-analytical entity anywhere in the document. For example, listing 11-2 is an illegal XML document because the logo is an non-analytical entity. If the logo here is an analytical entity, this example is a valid XML document.

Listing 11-2: An invalid XML document that attempts to embed an non-analytical entity with a generic entity reference

<?xml version= "1.0" standalone= "no"?>



<! ENTITY LOGO SYSTEM "logo.gif"

Ndata gif>

<! notation GIF SYSTEM "Image/gif"





To embed an non-analytical entity, instead of using a method such as a &LOGO; common entity reference, you declare an element as a placeholder for an entity that is not analytical, such as image. Then declare the image Element property source to be the entity type, and the Source property only provides an entity name that is not available for analysis. As shown in Listing 11-3.

Listing 11-3: Legitimate XML documents that correctly embed an non-analytical entity

<?xml version= "1.0" standalone= "no"?>



<! ENTITY LOGO SYSTEM "logo.gif"

Ndata gif>

<! notation GIF SYSTEM "Image/gif"


<! Attlist IMAGE Sourne ENTITY #REQUIRED >



<image sourne= "LOGO"/>


When an application reads an XML document, it recognizes the entity and displays it. The application may also not display an non-analytical entity (the Web browser will choose not to display the image when the user makes the image load invalid).

These examples show that empty elements are like containers for non-analytical entities, but this is not a method that must be used. For example, suppose you have an xml-based company ID system, which is a system used by security personnel to search for people entering a building; the person element has the name, PHONE, OFFICE, employee_id subclass, and photo entity attributes, as shown in Listing 11-4.

Listing 11-4: Non-empty person element with photo entity property

<?xml version= "1.0" standalone= "no"?>

<! DOCTYPE person [

<! ELEMENT person (NAME, employee_id, PHONE, OFFICE) >


<! ELEMENT employee_id (#PCDATA) >



<! notation JPEG SYSTEM "Image/jpg"

<! ENTITY ROGER SYSTEM "rogers.jpg" Ndata jpeg>

<! attlist person PHOTO ENTITY #REQUIRED >


<person photo= "ROGER" >

<name>jim rogers</name>



<OFFICE>RH 415a</office>


This example looks a bit contrived. In effect, make an empty dhoto element with the source attribute the child element of the person element, not the attribute of the person element. Furthermore, it may be possible to split this DTD into a subset of internal and external. As shown in Listing 11-5, the external subset declares elements, tokens, and attributes. These are the parts that can be shared by different documents. However, the entity changes from one document to another, so it is a good idea to place the entity in the internal DTD child set in the document shown in Listing 11-6.

Listing 11-5: External DTD subset PERSON.DTD

<! ELEMENT person (NAME, employee_id, PHONE, OFFICE, PHOTO) >


<! ELEMENT employee_id (#PCDATA) >




<! notation JPEG SYSTEM "Image/jpeg" >


Listing 11-6: A document with a non-empty person element and a subset of the internal DTD

<?xml version= "1.0" standalone= "no"?>

<! DOCTYPE person [

<! ENTITY% person _dtd SYSTEM "PERSON.DTD" >


<! ENTITY ROGER SYSTEM "rogers.jpg" Ndata jpeg>



<name>jim rogers</name>



<OFFICE>RH 415a</office>

<photo source= "ROGER"/>


11.2.3 embed multiple non-analytical entities

In some special situations, a single attribute or even an identification number may need to refer to more than one non-analytical entity. You can declare that the property of a placeholder element is a entities type. The Entities property value consists of a space-delimited number of irreducible entity names, each of which points to an external, non-XML-formatted data resource, and must declare all entities in the DTD. For example, you can write a slide show element to toggle a different picture in this way, and the DTD needs to declare the following form:

<! ELEMENT Slideshow Empty>

<! Attlist Slideshow SOURCES Entities #REQUIRED >

<! notation JPEG SYSTEM "Image/jpeg"

<! ENTITY HARM SYSTEM "charm.jpg" Ndata jpeg>

<! ENTITY Marjorie SYSTEM "Marjorie.jpg" Ndata jpeg>

<! ENTITY POSSUM SYSTEM "possum.jpg" Ndata jpeg>

<! ENTITY BLUE SYSTEM "blue.jpg" Ndata jpeg>

Then, in the document where you want the slide show to appear, you can insert the following markup:

<slideshow SOUR es= "CHARM Marjorie POSSUM BLUE" >

Again, this is not a magical scenario that an XML processor (or even any processor) can automatically understand, it's just a technique that browsers and other applications may or may not use when embedding non-XML data in a document.

11.3 Processing Instructions

Directives are frequently applied to private scopes that support HTML, such as server-side embedding, browser-tailored scripting languages, database templates, and many other projects beyond the scope of HTML standards. The benefit of using annotations for these purposes is that the rest of the system can simply ignore foreign data that they cannot understand. The disadvantage of this approach is that the document that stripped the annotation may no longer be the original document, and that only the annotation as a document is misinterpreted as the input data for these private scopes. To avoid misuse of annotations, XML provides a way to process instructions, as a clear mechanism for embedding information in a file, for private applications rather than XML parsers or browsers. The remaining uses include that processing instructions can provide additional information about how to view the non-analytical external entities.

The processing instructions are located in the. A line of text between and?> tags. The text in the processing instruction only needs the following syntactic structure, beginning with the XML name, followed by a space, followed by a space followed by the data. The XML name can be the actual name of the application (such as Latex), or a token name (such as Latex) that points to the application in the DTD, and the Latex declaration in the DTD has the following form:

<! notation LATEX SYSTEM "/usr/local/bin/latex" >

Even the name can be another name that can be identified by the application. For applications that use processing instructions, the details are very clear. Indeed, most applications that rely on processing instructions use more structure to handle the content of the instruction. For example, look at the following processing instructions used in the IBM Bean Markup Language:

<?BMLPI Register demos.calculator.eventsourcetext2int?>

The application name used for the processing instruction is BMLPI. The data that is given to the application is the string register Demos.calculator.EventSourceText2Int, which will contain all qualified Java class names. This tells the application named Bmlpi to use Java class demos.calculator.EventSourceText2Int to convert an action event to an integer. If BMLPI encounters this processing instruction while reading the document, it loads the class demos.calculator.EventSourceText2Int and then converts the event to an integer using the class element.

If that sounds clear and detailed, it is because they are. Processing directives are not part of the common structure of a document, they provide additional, unambiguous information for a particular application, rather than providing information for all applications that read the document. If the rest of the application encounters this description while reading the document, they will simply skip these instructions.

A processing instruction can be placed anywhere in an XML document except in a tag or CDATA field. They can be in the sequence process, DTD, element content, and even after the end of the document tag. Because processing directives are not elements, they do not affect the tree structure of the document. There is no need to turn processing instructions on or off, and it is not necessary to consider their nesting problems in other elements. The processing instruction is not a token and does not qualify the element.

We are already familiar with an example of a processing instruction: The Xml-stylesheet processing instruction combines a style sheet with a document:

<?xml-stylesheet type= "text/xsl" href= "baseball.xsl"?>

Although the processing instructions in these examples are in the order process, the processing instructions can appear anywhere in the document. Because processing directives are not elements, it is not necessary to declare them as child-class elements that contain their elements.

Processing instructions that begin with string XML are reserved for special purposes in the XML specification. In addition, in processing instructions, you can freely use any name and any text string except the end of document tag (?>). For example, the following example is a fully valid processing instruction:


<?acrobat document= "Passport.pdf"?>

? Dave Remember to the replace this one?>

Keep in mind that the XML processor does not handle any processing instructions, just passing them to the application. The application decides how to handle these instructions. Most applications simply skip processing instructions that they do not understand.

Sometimes it is not enough to understand the type of an entity that is not analytical. You also need to understand how your application runs and views entities, and what parameters you need to provide to your application. This information can be provided by processing instructions. Because processing instructions contain no restrictions on the data, it is relatively easy to formulate a description, which is what the external programs listed in the decision notation will take.

Such a processing instruction can be to view the data block's program name, or it can be thousands of bytes of configuration information. The author of an application and document must of course use the same approach to determine what kind of non-analytical external entities take the processing instructions. Listing 11-7 shows a scenario that uses a processing instruction and a PDF notation to notify Acrobat Reader about the PDF format of the physical paper so that Acrobat Reader displays the contents of the PDF.

Listing 11-7: Embedding a PDF document in XML

<?xml version= "1.0" standalone= "yes"?>


<! Notation PDF Public

"-//ietf//nonsgml Media Type application/pdf//en"


Application/pdf ">

<! ELEMENT PAPER (TITLE, author+, JOURNAL, date_received,



<! ENTITY PRLTAO000081000024005270000001 SYSTEM


LTA0000081000024005270000001 "

Ndata pdf>



<! ELEMENT year (#PNDATA) >


<! ELEMENT date_re eived (#PNDATA) >





? PDF acroread?>

<paper contents= "PRLTAO000081000024005270000001" >

<title>do naked singularities generically occur in

Generalized theories of gravity?</title>

<author>kengo maeda</author>

<author>takashi torii</author>

<author>makoto narita</author>

<journal>physical Review letters</journal>

<date_re eived> August 1998 </date_re eived>





Always remember that not all processor programs treat This example the way you want it to be. In fact, most processors are not. However, this is also a worthwhile consideration from the perspective of having an application support PDF files and other non-XML media types.

11.4 The conditions section of a DTD

When creating DTDs and documents, you might want to comment on parts of the document that do not reflect the DTD. In addition to using annotations directly, it is possible to place a specific set of declarations in a DTD in the Ignore directive, thereby ignoring the group of claims. The syntactic structure is as follows:


Declarations that are ignored


As usual, spaces do not have a substantial effect on the syntactic structure, but they must be guaranteed to be the start character (<! [IGNORE) and Terminator (]] >) occupies a separate line for reading.

You can ignore arbitrary declarations or a set of declarations-elements, entities, properties, and even other ignore blocks, but you must ignore the entire declaration. The ignore construct must fully contain all the declarations removed from the DTD. You cannot ignore only a part of a declaration (for example, a ndata GIF in an non-destructor declaration).

You can also specify a specific part of the declaration, that is, a part that is not ignored. The syntax structure of the include instruction is similar to ignore, but the key words are different:


Declarations that are included


Include and declaration are ignored when include is within ignore. When ignore is in include, the declaration within ignore is still ignored. In other words, include does not overwrite ignore.

In the scenario given above, it may be strange to have the presence of include. Simply remove the include block, leaving only their content, and no DTD will change. Include seems to be completely superfluous. However, it is a smart way to apply both ignore and include in situations where you cannot use ignore parameter entity references alone. First, define a parameter entity reference as follows:

<! ENTITY% fulldtd "IGNORE" >

You can ignore elements by wrapping them in the following structures:

<! [%fulldtd;



%fulldtd the value of the argument entity reference is ignore, so the declaration is ignored. Now, suppose you make a change to a word, change the fulldtd from ignore to include, as follows:

<! ENTITY% fulldtd "INCLUDE" >

All ignore blocks are immediately converted to include blocks. In fact, like a series of switches, you can turn the declaration block on or off.

In this case, only one switch--fulldtd is used. You can use this switch in multiple ignore/include blocks in a DTD. You can also have many different ignore/include blocks that you can choose to turn on or off according to different conditions.

This ability is especially useful when designing DTDs that are included in the remaining DTDs. By changing the value of the parameter entity switch, the last DTD can change the embedded DTD behavior.

11.5 Summary of this chapter

In this chapter, we learn how to combine non-XML data with XML documents by means of notation, an non-analytical external entity, and a processing instruction. Specifically, the following concepts were learned:

* notation describes the format of non-XML data.

* Non-analytical external entities are storage units that hold XML text and data.

* Use the Entity and entities properties to include an non-analytical external entity in the document.

* Processing instructions contain instructions for transferring from the processor to the final application without making any changes. S

* Include and ignore blocks specify whether to handle the declarations in the DTD included in the document when parsing it separately.

More examples of documents with DTDs are available in the later sections of this book. However, the most basic syntactic structure and usage of DTDs are discussed in this chapter. In the third part of the book, we begin to discuss the style language of XML, starting with the cascading style sheet (first level) in the next chapter.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.