This article supporting source code
XML is defined according to Unicode characters. In the process of transmitting and storing modern computers, those Unicode characters must be stored in bytes and decoded through the parser. Many coding schemes can achieve this goal: UTF-8, UTF-16, Iso-8859-1, Cp1252, and Sjis.
Usually, but not always, you don't really care about the basic code. The XML parser converts any document written to a Unicode string and a character array. The program operates on the decoded string. This article discusses the "infrequently occurring" situation that is really concerned with basic coding.
The most common scenario is to save the input encoding for the output result.
Alternatively, the document is stored in the database as a string or character large object (Character Large object, CLOB) without parsing it.
Similarly, when some systems transmit XML documents over HTTP, they do not read all the documents, but they need to set the HTTP Content-type header to specify the correct encoding. In this case, you need to know how the document is encoded.
In most cases, you know how to encode a document that you write. However, if you are not writing a document-just a document that you receive from another place (for example, from an Atom feed)-then the best approach is to use a streaming API, such as simple APIs for XML (SAX), streaming API for XML (StAX), System.Xml.XmlReader or Xerces Native Interface (XNI). In addition, you can also use a tree API, such as the Document Object Model (MODEL,DOM). However, they need to read the entire document, even if they usually need to read the first 100 bytes (or less) to determine the encoding. The streaming API can read only what is needed, and once the result is obtained, it is no longer resolved. This will be more efficient.
Sax
Currently, most sax parsers, including a SAX parser bound to Sun's Java™ Software Development Kit (JDK) 6, can be used to detect encodings. This technology is not difficult to implement, but it is not easy to understand. Can be simply summed up as:
In the Setdocumentlocator method, the Locator parameter is passed to the Locator2.
Save the Locator2 object in the field.
In the Startdocument method, call the GetEncoding () method of the Locator2 field.
(optional) If you've got all the results you want, you can throw saxexception to end the parsing process prematurely.