Xml
Just knowing that something is an XML format doesn't mean you can understand input with a generic XML parser. It may be interesting to try to cram the spectral data from Jupiter's satellites into the company's accounting program, but it will annoy accountants. This might be like cutting the Apache configuration file and pasting it into a graphics program. The public data format is great, but the grammatical standard does not imply that all programs can understand all the data.
XML is a poor human-machine interface
Until now, I've only discussed data formats for conversations between programs. In addition to the caveat in the previous section, XML should be a safe bet for most programs on program data format requirements. Programs, specifications, initialization files, and human-computer conversations what about something like that? In this section, I want to convince you that it is not necessarily necessary to write and fully understand XML. XML is probably far from the natural human language you can get, compared to many of the existing standard-specific languages that provide a good interface.
My argument is simple: humans have the innate ability to apply structures to strings (sentences), so tagging can only make us more difficult to read and harder to input. The problem is that most programmers have little experience in designing and parsing computer languages. Instead of spending time in designing and analyzing human-friendly languages, programmers might as well use the quickest path to provide the canonical language and implementation: "Oh, with XML." That's it. "And that's good, but I want programmers to realize that when they go that shortcut, they're providing a poor interface." Don't believe me? For the remainder of this article, I'll compare one-to-one to the human language and their unnatural XML-structured equivalents.
Let's start with a simple arithmetic expression. Humans have used specialized grammars for at least 1000 years, which is easier to read and write than XML languages?
Mathematical XML Syntax
3+4*5 <add>
<int>3</int>
<mult>
<int>4</int><int>5</int>
</mult>
</add>
Indeed, 3+4*5 is much easier to read and write. Humans have precisely constructed specialized domain-specific languages to effectively use these languages to describe problems (note that there are a large number of specialized programming languages, such as PostScript, PERL, Mathematica, and so on). The XML specification described above is a parse tree (struct) representation of an expression--remember the sentence structure diagram? Before processing, the language parser converts input into a parse tree, because a clear parse tree is easier to handle than an implicit sentence structure that is easy to understand by humans. Entering a clear structure avoids the need for a specialized parser in a program, but it adds a lot of burden to the user.
Lest you think the expression is too simple to be an appropriate example, consider the custom query language I designed and implemented to extract data from the Jguru object database. Listing 1 is a simple query.
Listing 1. No concise query in XML format
Query type person props (email,firstname,lastname) where "eid>100"
Humans, of course, prefer to enter a simple line of query statements, rather than the equivalent XML in Listing 2.
Listing 2. The same query in XML format <query>
<type>Person</type>
<props>
<prop id= "Email"/>
<prop id= "FirstName"/>
<prop id= "LastName"/>
</props>
<cond>
<gt>
<prop id= "EID"/>
<int>100</int>
</gt>
</cond>
</query>
Naturally, a specialized parser converts queries into a tree structure in memory, which is like XML. What I want to say is that humans input simple queries, while computers do work that explicitly calculus the structure. Note that the query results, a set of objects, that the serialized XML data is sent back to the client, because it indicates that the program communicates with the program and that it is an extremely appropriate use of XML.
And what about the bigger specs? Using human-oriented languages, they are easier to understand. Considering the course descriptor file, we use it to combine various course modules into a complete course on Jguru. Listing 3 shows a brief description of the Javaintro course.
Listing 3. Brief non-XML description of Java Introduction course
Course Javaintrocourse {
title = "Java Language Essentials"
Caption = "Core features of the Java programming language"
Mmlvers = "2"
Content {
Intro = "INTRO.MML"
variables = {useapplets= "false", gentm= "true", genibm= "true"}
Modules = {"Javaintro.mod", "Vscobol.mod"}
}
}
There's a lot of information in the description, but it's really just a bunch of assignment statements and string lists. Even non-programmers can read their basic meaning. I suspect that the equivalent XML in Listing 4 can be an understanding problem for most non programmers.
Listing 4. The XML equivalent of listing 3
<course>
<title>java Language essentials</title>
<caption>core features of the Java programming language</caption>
<mmlvers>2</mmlvers>
<content>
<intro>intro.mml</intro>
<variables>
<var id= "Useapplets" >false</var>
<var id= "Gentm" >true</var>
<var id= "GENIBM" >true</var>
</variables>
<modules>
<module>JavaIntro.mod</module>
<module>vsCOBOL.mod</module>
</modules>
</content>
</course>
Of course, even for programmers, the XML specification is more difficult to read than traditional specifications. There are so many XML "noise" distractions that the data won't jump in front of you. Yes, it's easier to read XML if you have experience (I'm even good at reading hexadecimal memory dumps when I write device drivers for industrial robots), but which one would you rather enter? You may be accustomed to anything, but why must you be accustomed to the difficult understanding of computer-friendly data when there is a more friendly alternative to humans?
It is worth mentioning that we use an XML class HTML markup Language to write the actual curriculum module text, because we need to embed in English prose can be clearly different from the English structure. Without the markup language, the modular parser cannot differentiate between course content and various parts of the tag, and so on. Sometimes, it is difficult to read the original module source code (also because of the XML "noise" interference), but no other specific areas of embedded language can be applied.
For many normative issues, natural languages, such as English, provide the most natural writing interface. Unfortunately, natural language is vague and extremely difficult to identify. However, a little effort can be made to define a well-defined but still concise subset of simplicity. Consider an adventure game language in which you might say:
Tease the Nice Velociraptor
Anyone who speaks English can analyze the above sentences even if they have never heard of Velociraptor. Instead, if you have to enter the content shown in Listing 5, how would you like to play such a game?
Listing 5. Tease the nice Velociraptor XML presentation
<command>
<verb>Tease</verb>
<object>
<article>the</article>
<nounmodifier>
<adjective>nice</adjective>
<noun>velociraptor</noun>
</nounmodifier>
</object>
</command>
Wow! How convincing it was. Humans never need or want to tag structural markers to make it easier for people to understand sentences. When typing, I want to say "tease the nice Velociraptor". On the other hand, when the game Records my command history, it may store the commands in XML to prevent them from having to parse the commands again.
I'm going to end my argument with a similar human-machine interface language: Humans speak. For computer programs, it is extremely difficult to see the digital speech signal (digital stream) and try to extract the sequence of English words. On the other hand, humans have evolved for more than millions of years to understand language effortlessly, so we find it to be a particularly satisfying interface. Any change in the natural way of speaking can reduce the effect of the interface. Imagine that you must "tag" your speech by providing additional information to a simple recognizer. Unfortunately, early commercial speech recognition programs do require you to: pause between spoken words! These pauses remove the uncertainty of the word boundary-one of the biggest problems-by explicitly saying equivalence:
<word>stupid</word><word>computer</word>
The use of XML tags in the human machine language is similar to the pauses between spoken words and is unpleasant.
Conclusion
Although XML is an important and complex standard format, it is just another data format. In most cases, it is meaningful to store or send data in XML format, with some exceptions: