Compress XML files for efficient transmission

Source: Internet
Author: User
Binary XML has aroused a lot of discussion. One reason is that relatively compact transmission formats are needed, especially for Web Services. The existing ready-to-use solution is compression. This tip illustrates how to use compression to prepare XML files for transmission in Web Services.
In the discussion of XML, the view of binary XML has always been heard. Due to its traditional text and the rules required to make international texts more friendly, XML is very lengthy. The equivalent binary format is much more compact. A long time ago (2000) Article "XML the future of EDI ?" (See references), I once demonstrated converting part of the ANSI EDI X12 order transaction (binary form) into XML. The resulting XML message is more than eight times longer than the original EDI message (the results of some other XML/EDI pilot projects seem to be only about three times more ). This lengthy nature poses some problems to the storage of XML, but at least the storage is very cheap today. Generally, the transmission capability is limited. the loudest call for Binary XML comes from users who use XML as the message transmission format, including some web service users.

One way to compress XML is to adopt a binary re-designed format. The leading candidate is ISO/ITU ASN.1, a data transmission standard that comes before XML. The updated ASN.1 provides some XML-related capabilities, which can re-express the XML format into a special form, such as ASN.1 Packed Encoding Rules, it defines a very compact binary code. Oasis ubl is an example. This Plan uses the ASN.1 method to compress XML data.

Compressing SOAP Encoding
If you need to transmit XML in the Web service, you may find that the load is too long. In this case, you can use one of multiple text compression options for XML content. Listing 1 is the XML/EDI example provided in the preceding article.

Listing 1. Example XML document for Web service exchange
<? XML version = "1.0" encoding = "UTF-8"?>
<Purchaseorder version = "4010">
<Purchaseorderheader>
<Transactionsetheader x12.id = "850">
<Transactionsetidcode code = "850" type = "regxph" text = "yourobjectname"/>
<Transactionsetcontrolnumber> 12345 </transactionsetcontrolnumber>
</Transactionsetheader>
<Beginningsegment>
<Purposetypecode code = "00 original"/>
<Ordertypecode code = "Sa stand-alone order"/>
<Purchaseordernumber> ret8999 </purchaseordernumber>
<Purchaseorderdate> 19981201 </purchaseorderdate>
</Beginningsegment>
<Admincommunicationscontact>
<Contactfunctioncode code = "oc order contact"/>
<Contactname> Obi anozie </contactname>
</Admincommunicationscontact>
</Purchaseorderheader>
<Purchaseorderdetail>
<Name1informationloop>
<Name>
<Entityidentifiercode = "by buying party"/>
<Entityname> Internet retailer Inc. </entityname>
<Identificationcodequalifier code = "91 assigned by seller"/>
<Identificationcode> ret8999 </identificationcode>
</Name>
<Name>
<Entityidentifiercode = "St ship to"/>
<Entityname> Internet retailer Inc. </entityname>
</Name>
<Addressinformation> 123 Via way </addressinformation>
<Geographiclocation>
<Cityname> Milwaukee </cityname>
<Stateprovincecode> wi </stateprovincecode>
<Postalcode> 53202 </postalcode>
</Geographiclocation>
</Name1informationloop>
<Baselineitemdata>
<Quantityordered> 100 </quantityordered>
<Unit code = "EA each"/>
<Unitprice> 1.23 </unitprice>
<Pricebasis code = "We wholesale price per each"/>
<Productidqualifier code = "mg manufacturer part number"/>
<Productid description = "fuzzy dice"> co633 </productid>
</Baselineitemdata>
</Purchaseorderdetail>
</Purchaseorder>

The original EDI example is only 200 bytes long, and the XML version is 1721 bytes long.

Well-known PK-ZIP routines can compress this XML file to 832 bytes.

The GNU Gzip routine compresses the object into 707 bytes.

Development in Bzip2Source codeThe routine compresses the file to 748 bytes.

All these compression formats are not as compact as the specialized EDI format, but the EDI format is not easy to understand. Bzip2 is well known for its better compression efficiency (at a slower compression speed) than gzip, But I have observed that the above results are not examples, that is to say, the XML processing Gzip is better than Bzip2.

Most platforms and languages currently provide compression libraries, at least including PK-ZIP and GNU Gzip CompressionAlgorithmYou can compress the data by programming before calling the web service.

Make sure to analyze whether standardization (c14n) helps to compress specific instances. C14n is a standardized method for generating the physical representation of an XML document, called the standard form, to solve the minor changes allowed by the XML syntax without changing the meaning. Based on Rough empirical methods, if XML is manually edited, the order of attributes and the use of spaces may change, and c14n may improve the compression performance of large documents. However, if XML is generated by machines or a large number of blank elements are used, c14n may be harmful. The preceding example is closer to the latter. I used the c14n module in the pyxml project for standardization. PythonCodeAs follows:

>>> From XML. Dom import minidom
>>> From XML. Dom. Ext import c14n
>>> Doc = minidom. parse ('listing1. xml ')
>>> C14n. canonicalize (DOC)
>>> F = open ('listing1-canonical. xml', 'w ')
>>> C14n. canonicalize (Doc, output = f)
>>> F. Close ()

The resulting file listing1-canonical.xml contains 1867 bytes and 714 bytes after gzip compression. Uncompressed text contains 146 more bytes, and Gzip is compressed with 7 more bytes. The main reason is that the blank elements are represented in the most lengthy form after c14n. For example, the following line:

<Unit code = "EA each"/>

It becomes

<Unit code = "EA each"> </unit>

There are two ways to bind the compressed XML of routines such as gzip to soap:

Use some form of Companion tool.
Use base64 encoding for the message body content.
Base64 only uses common text characters to present binary documents. This can be done using the libraries prepared on any platform. Base64 encoding data even has a W3C XML Schema. If you set up the web service correctly, your tool can automatically implement base64 encoding and decoding. Unfortunately, base64 partially offsets the compression effect. Base64 encoding is larger than the original document, with a ratio of around 4: 3. After base64 encoding is used, the gzip compression result of Listing 1 is 957 bytes.

Conclusion
Generally, if Gzip is applied to an XML file and the compressed result is base64 encoded, The result file is usually only half the size of the original file when it is transmitted by a machine in soap. This may meet your need to save space in XML Web Services. If not, take a good look at ASN.1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.