1. Error phenomena
I have encountered such an XML file, in C + + parsing, reported as the following error:
Fatal Error at file ' D:/test2.xml ', line 1, column 40
Message:an Exception occurred! Type:utfdataformatexception, Message:invalid byte 2 (? a 2-byte sequence.
When parsing this file in Java, report the following error
Invalid byte 1 of 1-byte UTF-8 sequence.
Java.io.UTFDataFormatException:Invalid byte 1 of 1-byte UTF-8 sequence.
At Org.apache.xerces.impl.io.UTF8Reader.invalidByte (Unknown Source)
At Org.apache.xerces.impl.io.UTF8Reader.read (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.load (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.skipChar (Unknown Source)
At Org.apache.xerces.impl.xmldocumentfragmentscannerimpl$fragmentcontentdispatcher.dispatch (Unknown Source)
At Org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XMLParser.parse (Unknown Source)
At Org.apache.xerces.parsers.DOMParser.parse (Unknown Source)
At Org.apache.xerces.jaxp.DocumentBuilderImpl.parse (Unknown Source)
At Javax.xml.parsers.DocumentBuilder.parse (Unknown Source)
2. Program Content
The contents of the Java program are as follows:
Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); Document doc = Docbuilder.parse ("D:/test2.xml"); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); Root node naming}catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }
C + + programs use the XERCES-C, the code snippet is as follows:
Xercesdomparser *parser = NULL; Domtreeerrorreporter *errreporter = NULL; DOMDocument *document = NULL; Domnode *domrootnode = NULL; try {//Initialize parser = new Xercesdomparser; errreporter = new Domtreeerrorreporter (); Parser->seterrorhandler (Errreporte R); Parser->parse ("D//test2.xml"); if (Errreporter->getsawerrors ()) {printf ("XML format error/n"); Delete errreporter; Delete parser; return-1}} catch (const outofmemoryexception& Outex) {}catch (const xmlexception& e) {} catch (const domexception& DomE ) {} catch (...) { }
3. Causal analysis
The contents of document Test2.xml are as follows:
<?xml version= "1.0" encoding= "UTF-8"?> <InterBOSS> <OrderMemberInfo> <OrderNumber/> < Productid>1</productid> <OrderSource>1</OrderSource> <ProductOrderMembers> < productordermember> <ww>1</ww> <Action>1</Action> <membertypeid>1</ membertypeid> <EffDate>20090512161449</EffDate> <Extends> <Extend> <CharacterID> 95105</characterid> <CharacterName> positioning mode </CharacterName> <charactervalue>a</ charactervalue> </Extend> <Extend> <CharacterID>95108</CharacterID> <charactername > Terminal id</charactername> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95106</CharacterID> <CharacterName> Terminal plate number </CharacterName> <charactervalue >123456</CharacterValue> </Extend> <Extend> <CharacterID>95107</CharacterID> < Charactername> Terminal model </CharacterName> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95109</CharacterID> <CharacterName> terminal type </CharacterName> <charactervalue >2</CharacterValue> </Extend> </Extends> </ProductOrderMember> </productordermembers > </OrderMemberInfo> </InterBOSS>
I put this file locally, using IE to open the file, parsing has problems, can not be displayed properly.
Modify encoding= "GBK", then say with IE open, there is no problem.
Run the Java program and C + + program again, normal operation, no report error.
There is a problem with the file format, the simplest solution is to modify the encoding section of the file.
But sometimes this can be done without the ability to operate well.
The problem I encountered was that someone sent me a wrong file, I called each other, told him that the file format has a problem, the other side does not admit, also said that other manufacturers why did not find him, "Halo", anyway, do not modify, and the program has been automatically run, manual to modify obviously does not show that the content of the script to modify the contents of the
Rest assured that the program is "omnipotent", usually encountered by the so-called bug, seemingly impossible problems, after careful analysis, can be solved through the program to solve the problem.
4. Analysis, Problem solving
Since the contents of the XML file are not encoded as encoding specified, we can read the contents of the file, then encode it, parse the encoded content, and solve the problem.
If you do not modify the contents of the file, modify the Java program, which reads as follows:
Import Java.io.ByteArrayInputStream; Import Java.io.File; Import Java.io.FileInputStream; Import Java.io.InputStream; Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); File File = new file ("D:/test2.xml"); if (!file.exists ()) {System.out.println (File.getabsolutepath () + "file does not exist") return;}//Read first file StringBuffer Strbuffer = new S Tringbuffer (); FileInputStream Filein = new FileInputStream (file); byte[] buf = new byte[1024]; int len = 0; while (len = Filein.read (buf)) > 0 {String str = new string (buf, 0, Len); Strbuffer.append (str); System.out.print (Strbuffer.tostring ()); Bytearrayinputstream stream = new Bytearrayinputstream (strbuffer.tostring ()). gEtbytes ("Utf-8")); InputStream instream = Document doc = Docbuilder.parse (stream); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); root node name catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }
Run correctly and print the root node name successfully.
The program first reads the contents of the file and then converts it into a utf-8 encoding, which is assembled into a character stream for parsing.
C + + program is the same, first read out the contents of the file, and then the ICU converted into UTF-8 encoding, assembled into a content structure to resolve.
The C + + program section reads as follows:
Xercesdomparser *parser = NULL; Domtreeerrorreporter *errreporter = NULL; DOMDocument *document = NULL; Domnode *domrootnode = NULL; try {//Initialize parser = new Xercesdomparser; errreporter = new Domtreeerrorreporter (); Parser->seterrorhandler (Errreporte R); 1. read file *freadfile = NULL; if ((Freadfile = fopen (Fullfilename.c_str (), "R")) ==null) {printf ("Error reading file, file not present or no Read permission!/n"); return-1; AISTD String strbuf;//Store the contents of the file char line[g_line_size]; int read_len = 0; while (!feof (Freadfile)) {Char *s = fgets (line, g_line_size, freadfile); if (s = = NULL) {if (ferror (Freadfile)) {Errdesc = Strerror (Ferror (freadfile)); return-1; else break; } strbuf + = line; //2. Encode the contents of the file UnicodeString unicodestr (Strbuf.c_str ());//Get Unicode Uconverter *conv = NULL; Uerrorcode status = U_zero_error; Conv = Ucnv_open ("Utf-8", &status);//Open Conversion service function if (u_failure (status) {Errdesc = "failed to create ICU translator handle."; return-1;}//Get target Word string uint32_t len = 0, len2 = 0; Len2 = Unicodestr.length (); uint32_t Xmllen = UnicodesTr.length () * 3; char* xmlbuf = new Char[xmllen]; uchar* uchar_buf = Unicodestr.getbuffer (Unicodestr.length ());//Get Unicode buffer Address len = ucnv_fromuchars (CONV, xmlbuf, Xmllen, Uchar_buf, Len2, &status); if (u_failure (status)) {fprintf (stderr, status = =%s/n, u_errorname (status)) return-1;}//Get target string Xmlbuf[len] = 0;//will encoded XML message truncation Membufinputsource *membufis = new Membufinputsource (const xmlbyte*) xmlbuf, strlen (xmlbuf), "Strbuf", False ); Start parsing message parser->parse (*MEMBUFIS); if (Errreporter->getsawerrors ()) {printf ("XML malformed/n"); Delete errreporter; Delete parser return-1;}//Get Document Object Docume NT = Parser->getdocument (); Get DOM root node Domrootnode = (domnode*) document->getdocumentelement (); delete []xmlbuf; }catch (otl_exception& p) {}catch (const outofmemoryexception& Outex) {}catch (const xmlexception& e) {} CATC H (const domexception& DomE) {} catch (...) { }
Some of the functions of the ICU were used, and the brothers who had not used the ICU could look at the information.