Resolving errors in parsing message formats and encoding inconsistencies in XML parsing

Source: Internet
Author: User
Tags getmessage

1. Error phenomena

I have encountered such an XML file, in C + + parsing, reported as the following error:

Fatal Error at file ' D:/test2.xml ', line 1, column 40
Message:an Exception occurred! Type:utfdataformatexception, Message:invalid byte 2 (? a 2-byte sequence.

When parsing this file in Java, report the following error

Invalid byte 1 of 1-byte UTF-8 sequence.
Java.io.UTFDataFormatException:Invalid byte 1 of 1-byte UTF-8 sequence.
At Org.apache.xerces.impl.io.UTF8Reader.invalidByte (Unknown Source)
At Org.apache.xerces.impl.io.UTF8Reader.read (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.load (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.skipChar (Unknown Source)
At Org.apache.xerces.impl.xmldocumentfragmentscannerimpl$fragmentcontentdispatcher.dispatch (Unknown Source)
At Org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XMLParser.parse (Unknown Source)
At Org.apache.xerces.parsers.DOMParser.parse (Unknown Source)
At Org.apache.xerces.jaxp.DocumentBuilderImpl.parse (Unknown Source)
At Javax.xml.parsers.DocumentBuilder.parse (Unknown Source)

2. Program Content

The contents of the Java program are as follows:

Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); Document doc = Docbuilder.parse ("D:/test2.xml"); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); Root node naming}catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }

C + + programs use the XERCES-C, the code snippet is as follows:

Xercesdomparser *parser = NULL; Domtreeerrorreporter *errreporter = NULL; DOMDocument *document = NULL; Domnode *domrootnode = NULL; try {//Initialize parser = new Xercesdomparser; errreporter = new Domtreeerrorreporter (); Parser->seterrorhandler (Errreporte R); Parser->parse ("D//test2.xml"); if (Errreporter->getsawerrors ()) {printf ("XML format error/n"); Delete errreporter; Delete parser; return-1}} catch (const outofmemoryexception& Outex) {}catch (const xmlexception& e) {} catch (const domexception& DomE ) {} catch (...) { }

3. Causal analysis

The contents of document Test2.xml are as follows:

<?xml version= "1.0" encoding= "UTF-8"?> <InterBOSS> <OrderMemberInfo> <OrderNumber/> < Productid>1</productid> <OrderSource>1</OrderSource> <ProductOrderMembers> < productordermember> <ww>1</ww> <Action>1</Action> <membertypeid>1</ membertypeid> <EffDate>20090512161449</EffDate> <Extends> <Extend> <CharacterID> 95105</characterid> <CharacterName> positioning mode </CharacterName> <charactervalue>a</ charactervalue> </Extend> <Extend> <CharacterID>95108</CharacterID> <charactername > Terminal id</charactername> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95106</CharacterID> <CharacterName> Terminal plate number </CharacterName> <charactervalue >123456</CharacterValue> </Extend> <Extend> <CharacterID>95107</CharacterID> < Charactername> Terminal model </CharacterName> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95109</CharacterID> <CharacterName> terminal type </CharacterName> <charactervalue >2</CharacterValue> </Extend> </Extends> </ProductOrderMember> </productordermembers > </OrderMemberInfo> </InterBOSS>

I put this file locally, using IE to open the file, parsing has problems, can not be displayed properly.

Modify encoding= "GBK", then say with IE open, there is no problem.

Run the Java program and C + + program again, normal operation, no report error.

There is a problem with the file format, the simplest solution is to modify the encoding section of the file.

But sometimes this can be done without the ability to operate well.

The problem I encountered was that someone sent me a wrong file, I called each other, told him that the file format has a problem, the other side does not admit, also said that other manufacturers why did not find him, "Halo", anyway, do not modify, and the program has been automatically run, manual to modify obviously does not show that the content of the script to modify the contents of the

Rest assured that the program is "omnipotent", usually encountered by the so-called bug, seemingly impossible problems, after careful analysis, can be solved through the program to solve the problem.

4. Analysis, Problem solving

Since the contents of the XML file are not encoded as encoding specified, we can read the contents of the file, then encode it, parse the encoded content, and solve the problem.

If you do not modify the contents of the file, modify the Java program, which reads as follows:

Import Java.io.ByteArrayInputStream; Import Java.io.File; Import Java.io.FileInputStream; Import Java.io.InputStream; Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); File File = new file ("D:/test2.xml"); if (!file.exists ()) {System.out.println (File.getabsolutepath () + "file does not exist") return;}//Read first file StringBuffer Strbuffer = new S Tringbuffer (); FileInputStream Filein = new FileInputStream (file); byte[] buf = new byte[1024]; int len = 0; while (len = Filein.read (buf)) > 0 {String str = new string (buf, 0, Len); Strbuffer.append (str); System.out.print (Strbuffer.tostring ()); Bytearrayinputstream stream = new Bytearrayinputstream (strbuffer.tostring ()). gEtbytes ("Utf-8")); InputStream instream = Document doc = Docbuilder.parse (stream); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); root node name catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }

Run correctly and print the root node name successfully.

The program first reads the contents of the file and then converts it into a utf-8 encoding, which is assembled into a character stream for parsing.

C + + program is the same, first read out the contents of the file, and then the ICU converted into UTF-8 encoding, assembled into a content structure to resolve.

The C + + program section reads as follows:

Xercesdomparser *parser = NULL; Domtreeerrorreporter *errreporter = NULL; DOMDocument *document = NULL; Domnode *domrootnode = NULL; try {//Initialize parser = new Xercesdomparser; errreporter = new Domtreeerrorreporter (); Parser->seterrorhandler (Errreporte R); 1. read file *freadfile = NULL; if ((Freadfile = fopen (Fullfilename.c_str (), "R")) ==null) {printf ("Error reading file, file not present or no Read permission!/n"); return-1; AISTD String strbuf;//Store the contents of the file char line[g_line_size]; int read_len = 0; while (!feof (Freadfile)) {Char *s = fgets (line, g_line_size, freadfile); if (s = = NULL) {if (ferror (Freadfile)) {Errdesc = Strerror (Ferror (freadfile)); return-1; else break; } strbuf + = line; //2. Encode the contents of the file UnicodeString unicodestr (Strbuf.c_str ());//Get Unicode Uconverter *conv = NULL; Uerrorcode status = U_zero_error; Conv = Ucnv_open ("Utf-8", &status);//Open Conversion service function if (u_failure (status) {Errdesc = "failed to create ICU translator handle."; return-1;}//Get target Word string uint32_t len = 0, len2 = 0; Len2 = Unicodestr.length (); uint32_t Xmllen = UnicodesTr.length () * 3; char* xmlbuf = new Char[xmllen]; uchar* uchar_buf = Unicodestr.getbuffer (Unicodestr.length ());//Get Unicode buffer Address len = ucnv_fromuchars (CONV, xmlbuf, Xmllen, Uchar_buf, Len2, &status); if (u_failure (status)) {fprintf (stderr, status = =%s/n, u_errorname (status)) return-1;}//Get target string Xmlbuf[len] = 0;//will encoded XML message truncation Membufinputsource *membufis = new Membufinputsource (const xmlbyte*) xmlbuf, strlen (xmlbuf), "Strbuf", False ); Start parsing message parser->parse (*MEMBUFIS); if (Errreporter->getsawerrors ()) {printf ("XML malformed/n"); Delete errreporter; Delete parser return-1;}//Get Document Object Docume NT = Parser->getdocument (); Get DOM root node Domrootnode = (domnode*) document->getdocumentelement (); delete []xmlbuf; }catch (otl_exception& p) {}catch (const outofmemoryexception& Outex) {}catch (const xmlexception& e) {} CATC H (const domexception& DomE) {} catch (...) { }

Some of the functions of the ICU were used, and the brothers who had not used the ICU could look at the information.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.