Using sax and Xni to detect the encoding of XML documents

Source: Internet
Author: User
Tags object model xml parser

This article supporting source code

XML is defined according to Unicode characters. In the process of transmitting and storing modern computers, those Unicode characters must be stored in bytes and decoded through the parser. Many coding schemes can achieve this goal: UTF-8, UTF-16, Iso-8859-1, Cp1252, and Sjis.

Usually, but not always, you don't really care about the basic code. The XML parser converts any document written to a Unicode string and a character array. The program operates on the decoded string. This article discusses the "infrequently occurring" situation that is really concerned with basic coding.

The most common scenario is to save the input encoding for the output result.

Alternatively, the document is stored in the database as a string or character large object (Character Large object, CLOB) without parsing it.

Similarly, when some systems transmit XML documents over HTTP, they do not read all the documents, but they need to set the HTTP Content-type header to specify the correct encoding. In this case, you need to know how the document is encoded.

In most cases, you know how to encode a document that you write. However, if you are not writing a document-just a document that you receive from another place (for example, from an Atom feed)-then the best approach is to use a streaming API, such as simple APIs for XML (SAX), streaming API for XML (StAX), System.Xml.XmlReader or Xerces Native Interface (XNI). In addition, you can also use a tree API, such as the Document Object Model (MODEL,DOM). However, they need to read the entire document, even if they usually need to read the first 100 bytes (or less) to determine the encoding. The streaming API can read only what is needed, and once the result is obtained, it is no longer resolved. This will be more efficient.

Sax

Currently, most sax parsers, including a SAX parser bound to Sun's Java™ Software Development Kit (JDK) 6, can be used to detect encodings. This technology is not difficult to implement, but it is not easy to understand. Can be simply summed up as:

In the Setdocumentlocator method, the Locator parameter is passed to the Locator2.

Save the Locator2 object in the field.

In the Startdocument method, call the GetEncoding () method of the Locator2 field.

(optional) If you've got all the results you want, you can throw saxexception to end the parsing process prematurely.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.