Improvements in XML compression and transmission performance

Source: Internet
Author: User
Tags chr compact join ord requires web services zip fast web
XML Application--xml improved compression and transfer performance XML is a text markup format designed for clarity and ease of use, without regard for brevity. As with any design, XML has some weaknesses, one of which is the cost of translating application data to XML representations or vice versa. This overhead can be a major part of the total processing cost of many applications, especially those that exchange large amounts of data and handle relatively few internally. XML documents tend to be large compared to other forms of data representation. So there are times when bandwidth and storage space can be very important. This article discusses some of the issues involved in non-textual representations of XML, and describes several methods that are being developed for this purpose
the tradeoff between bandwidth and processing when transferring XML documents
Document SizeThe XML representation of the data tends to be much larger than the binary representation of the same data, and there are two main reasons: the text representation of a simple data value is usually larger than the binary representation of the same value. XML is a text markup format designed for clarity and interoperability rather than simplicity. Once this redundancy is combined with a fairly long element and attribute name that is often used in XML applications, the size of the markup component in an XML document can be much larger than the actual data component in the document. The addition of blank content for formatting purposes further increases the size of the document. A larger document size means that the XML representation of the transmitted data requires more bandwidth than the equivalent binary representation. Larger dimensions also mean higher processing costs because the overhead involved in the communication process is largely related to the amount of data. processing OverheadXML also requires more processing overhead than a simple binary data representation. From the input, the XML document handler must recognize multiple types of character combinations that represent different markup forms in the text of the input document. Handlers also need to verify that each document is well-formed XML, so you must check the transitions of the state when processing tokens. Although namespaces are optional, their application is becoming more common, requiring the definition of the namespace to be traced through the associated prefix and the prefix used to identify and dereference the element names and attribute names in the dereference tag. Finally, the text XML data might need to be transformed into a typed binary value so that the application receiving it can use it. There is a similar problem with XML document output. Both the input and the output need to be transformed with the specific character encoding used in the XML document text, adding complexity. XML handlers are often designed to handle many different possible encodings. So even if the document exchange between the two applications always uses some sort of pre-determined encoding, it still has to pay the cost of ensuring commonality.   the limitations of jumping out of textXML is only defined by text, so it is impossible to strictly say anything other than text is XML. On the other hand, applications that use XML to exchange data may be more concerned with the data being passed, rather than the strict XML representation. Depending on the extent to which you want to adhere to XML text features, you can choose from a variety of different technologies to reduce document size, improve processing speed, or both.   text-First: general text conversionThe most rigorous technique for following XML text features is general text-based conversions. This type of conversion is primarily related to document size. Text compression algorithm has been a large number of research projects for many years, and is now very mature. Any algorithm of this type can be conveniently used for text representations of XML documents. Such algorithms are unlikely to improve processing speed because, in fact, outside of ordinary XML text processing, they add a layer of transformation at both ends of the data flow between applications: compressing the text on the sending side and extracting it at the receiving end.   data first: for specific XML application-Tailored formatThe format specifically designed for a particular XML application is another extreme relative to text-based compression. These formats may be equivalent to the pure binary representation of the data, and it is possible to compress the size of the data and reduce processing overhead when compared to text XML. The main disadvantage of this approach is that it must be tailored to the document structure used by the application, and the sender and receiver must agree on the specific structure in advance and implement the appropriate coding/decoding program. Most methods for application-specific coding are based on the XML Schema specification of the Exchange document. The type and structure information contained in the pattern is used to generate custom coded program/decoder code, which may include objects that represent the contents of the document's data and interact directly with the encoder/decoder program. Fast Web Services is an example of custom coding from patterns based on ASN.1 structured data representation. This type of pattern coding is difficult to compare with other methods of handling generic XML documents. In order to test various documents, you must first define a pattern for each document, and then generate the encoding program/decoder code for this pattern. It is generally impossible to complete this data representation and conversion of standard XML text forms with the current implementation, so there is a need to write a custom conversion program that translates the standard text XML document into a form that can be used in this encoding. For these reasons, my test results do not include examples of this approach, but after you have seen these results, I will also mention the use of pattern encodings.
XML Compression

When considering compressing documents, it is common to consider commonly used compression algorithms, such as: Lempel-ziv and Huffman, and some common utilities that implement changes on them. In particular, on Unix-like platforms, the first thought is usually the utility gzip; zip is more common on other platforms (using utilities such as PKZIP, Info-zip, and WinZip). It has been proven that gzip is always better than zip, but less people are used. These utilities are actually intended to adequately reduce the size of the XML file. However, it is also proved that a good compression ratio can be obtained by two methods-alone or in combination.

The first technique uses the Burrows-wheeler compression algorithm instead of the sequential Lempel-ziv algorithm.

The second technique is to generate more compressible representations with very specific structures of XML documents. This article Gets or creates four basic documents for comparison purposes. The first one is Shakespeare's play Hamlet as an XML document (see Resources). Tags include tags such as <PERSONA>, <SPEAKER>, and <LINE>, which are very naturally mapped to typographic patterns that people may encounter in printed copies. In order to compare how XML markup can help document size and compressibility, I derived a document Hamlet.txt from Hamlet.xml, simply by deleting all XML tags and preserving their contents. This derivation is irreversible and is an absolute loss of information.

The other two files are Apache Weblog files (a concise set of line-oriented records) and XML documents created from this file. Because the source document is a log file, no information is lost in the transformation, and it is tedious to recreate the original format document from XML. Reversible Conversions when XML documents involve a fairly inefficient form of compression, bzip2 slightly reduces this inefficiency by re-grouping strings. In essence, XML documents are mixtures of very different parts-different types of tags, attributes, and element bodies. The standard compression program will have a lot of work to do if you can get a collection of each relatively consistent set of things and tightly group them among the converted files. For example, if each Import SYS Import Xml.sax

From xml.sax.handler Import *  

class Structextractor (ContentHandler):

    "" Create a special structure/content form of an XML document "" "  & nbsp;  def startdocument (self):         self.taglist = []         self.contentdct = {}          self.state = []             # Stack for tag state

        self.newstate = 0            # Flag for continuing chars in same Elem         Self.struct = []            # Compact Document structure       def enddocument (self):          sys.stdout.write ('/n '. Join (self.taglist))

                                      # Write out the TagList first         sys.stdout.write (Chr ( 0))     # section delimiter/0x00         Sys.stdout.write (". Join (self.struct))

# Write out the structure list sys.stdout.write (chr (0)) # section delimiter/0x00 For tag in self.taglist: # Write All content lists

Sys.stdout.write (Chr (2). Join (Self.contentdct[tag]))

            sys.stdout.write (Chr (1)) # Delimiter between content Types       def startelement (self, Name, attrs):         if not name in Self.taglist:             self.taglist.append (name)             Self.contentdct[name] = []             If Len (self.taglist) > 253:

raise ValueError, "more than 253 tags encountered" self.state.append (name) # Push current Tag self.newstate = 1 # chars go to New item

# Single char to indicate tag

self.struct.append (Chr (self.taglist.index (name) +3))   def endElement (self, name): Self.state.pop () # Pop current tag off stack self.newstate = 1 # chars go to New item self.struct.append (Chr (1)) #/0x01 is endtag in struct   def characters (self, ch): currstate = self.state[-1] if self.newstate: # either Add new chars to state item

self.contentdct[currstate].append (CH) self.newstate = 0 self.struct.append (Chr (2))

#/0x02 content placeholder in struct

Else: # or append the chars to current item

self.contentdct[currstate][-1] + = ch  

if __name__ = = ' __main__ ': parser = Xml.sax.make_parser () handler = Structextractor () Parser.setcontenthandler (handler) parser.parse (Sys.stdin)

Using sax instead of the DOM makes this transition quite time-saving, even if time is not the main consideration for developing it. Reverse conversion, struct2xml.py def struct2xml (s): tags, struct, content = s.split (chr (0)) tag List = Tags.split ('/n ') # all the tags

    contentlist = []                 # List-of-lists of content Items     for block in Content.split (Chr (1)):         contents = Block.split (Chr (2))         contents.reverse ()            # Pop off content items from end         Contentlist.append (contents)

    State = []                      # Stack for tag state     Skeleton = []                    # templatized version of XML     for C in struct:  &nb sp;      i = Ord (c)         If I >= 3:                   # Start of Element

            I-= 3                   # Adjust for struct tag index offset             tag = taglist[i]         # Spell out the tags from taglist       & nbsp;     state.append (TAG)        # Push Current Tag             skeleton.append (' <%s> '% tag)

# Insert the element start tag elif i = = 1: # End of Element tag = State.pop () # Pop current tag off stack

skeleton.append (' </%s> '% tag)

# Insert the element end tag elif i = = 2: # Insert element content tag = state[-1]

item = contentlist[taglist.index (tag)].pop ()

item = item.replace (' & ', ' & ') Skeleton.append (item) # Add bare tag to indicate content Else:

raise ValueError, "unexpected structure Tag:ord (%d)"% i return '. Join (skeleton)

if __name__ = = ' __main__ ':   

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.