Resolving errors in parsing message formats and encoding inconsistencies in XML parsing

Last Update:2018-07-27 Source: Internet

Author: User

Tags getmessage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Error phenomena

I have encountered such an XML file, in C + + parsing, reported as the following error:

Fatal Error at file ' D:/test2.xml ', line 1, column 40
Message:an Exception occurred! Type:utfdataformatexception, Message:invalid byte 2 (? a 2-byte sequence.

When parsing this file in Java, report the following error

Invalid byte 1 of 1-byte UTF-8 sequence.
Java.io.UTFDataFormatException:Invalid byte 1 of 1-byte UTF-8 sequence.
At Org.apache.xerces.impl.io.UTF8Reader.invalidByte (Unknown Source)
At Org.apache.xerces.impl.io.UTF8Reader.read (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.load (Unknown Source)
At Org.apache.xerces.impl.XMLEntityScanner.skipChar (Unknown Source)
At Org.apache.xerces.impl.xmldocumentfragmentscannerimpl$fragmentcontentdispatcher.dispatch (Unknown Source)
At Org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
At Org.apache.xerces.parsers.XMLParser.parse (Unknown Source)
At Org.apache.xerces.parsers.DOMParser.parse (Unknown Source)
At Org.apache.xerces.jaxp.DocumentBuilderImpl.parse (Unknown Source)
At Javax.xml.parsers.DocumentBuilder.parse (Unknown Source)

2. Program Content

The contents of the Java program are as follows:

Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); Document doc = Docbuilder.parse ("D:/test2.xml"); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); Root node naming}catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }

C + + programs use the XERCES-C, the code snippet is as follows:

3. Causal analysis

The contents of document Test2.xml are as follows:

<?xml version= "1.0" encoding= "UTF-8"?> <InterBOSS> <OrderMemberInfo> <OrderNumber/> < Productid>1</productid> <OrderSource>1</OrderSource> <ProductOrderMembers> < productordermember> <ww>1</ww> <Action>1</Action> <membertypeid>1</ membertypeid> <EffDate>20090512161449</EffDate> <Extends> <Extend> <CharacterID> 95105</characterid> <CharacterName> positioning mode </CharacterName> <charactervalue>a</ charactervalue> </Extend> <Extend> <CharacterID>95108</CharacterID> <charactername > Terminal id</charactername> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95106</CharacterID> <CharacterName> Terminal plate number </CharacterName> <charactervalue >123456</CharacterValue> </Extend> <Extend> <CharacterID>95107</CharacterID> < Charactername> Terminal model </CharacterName> <CharacterValue>TEST</CharacterValue> </Extend> <Extend> <CharacterID>95109</CharacterID> <CharacterName> terminal type </CharacterName> <charactervalue >2</CharacterValue> </Extend> </Extends> </ProductOrderMember> </productordermembers > </OrderMemberInfo> </InterBOSS>

I put this file locally, using IE to open the file, parsing has problems, can not be displayed properly.

Modify encoding= "GBK", then say with IE open, there is no problem.

Run the Java program and C + + program again, normal operation, no report error.

There is a problem with the file format, the simplest solution is to modify the encoding section of the file.

But sometimes this can be done without the ability to operate well.

The problem I encountered was that someone sent me a wrong file, I called each other, told him that the file format has a problem, the other side does not admit, also said that other manufacturers why did not find him, "Halo", anyway, do not modify, and the program has been automatically run, manual to modify obviously does not show that the content of the script to modify the contents of the

Rest assured that the program is "omnipotent", usually encountered by the so-called bug, seemingly impossible problems, after careful analysis, can be solved through the program to solve the problem.

4. Analysis, Problem solving

Since the contents of the XML file are not encoded as encoding specified, we can read the contents of the file, then encode it, parse the encoded content, and solve the problem.

If you do not modify the contents of the file, modify the Java program, which reads as follows:

Import Java.io.ByteArrayInputStream; Import Java.io.File; Import Java.io.FileInputStream; Import Java.io.InputStream; Import Javax.xml.parsers.DocumentBuilder; Import Javax.xml.parsers.DocumentBuilderFactory; Import org.w3c.dom.Document; Import org.w3c.dom.Element; public class Xmltest {/** * @param args */public static void main (string[] args) {try {documentbuilderfactory Docbuild Erfactory = Documentbuilderfactory.newinstance (); Documentbuilder Docbuilder = Docbuilderfactory.newdocumentbuilder (); File File = new file ("D:/test2.xml"); if (!file.exists ()) {System.out.println (File.getabsolutepath () + "file does not exist") return;}//Read first file StringBuffer Strbuffer = new S Tringbuffer (); FileInputStream Filein = new FileInputStream (file); byte[] buf = new byte[1024]; int len = 0; while (len = Filein.read (buf)) > 0 {String str = new string (buf, 0, Len); Strbuffer.append (str); System.out.print (Strbuffer.tostring ()); Bytearrayinputstream stream = new Bytearrayinputstream (strbuffer.tostring ()). gEtbytes ("Utf-8")); InputStream instream = Document doc = Docbuilder.parse (stream); Element root = Doc.getdocumentelement (); System.out.println ("Root-->" +root.gettagname ()); root node name catch (Exception e) {System.out.println (E.getmessage ()); E.printstacktrace ();} }

Run correctly and print the root node name successfully.

The program first reads the contents of the file and then converts it into a utf-8 encoding, which is assembled into a character stream for parsing.

C + + program is the same, first read out the contents of the file, and then the ICU converted into UTF-8 encoding, assembled into a content structure to resolve.

The C + + program section reads as follows:

Xercesdomparser *parser = NULL; Domtreeerrorreporter *errreporter = NULL; DOMDocument *document = NULL; Domnode *domrootnode = NULL; try {//Initialize parser = new Xercesdomparser; errreporter = new Domtreeerrorreporter (); Parser->seterrorhandler (Errreporte R); 1. read file *freadfile = NULL; if ((Freadfile = fopen (Fullfilename.c_str (), "R")) ==null) {printf ("Error reading file, file not present or no Read permission!/n"); return-1; AISTD String strbuf;//Store the contents of the file char line[g_line_size]; int read_len = 0; while (!feof (Freadfile)) {Char *s = fgets (line, g_line_size, freadfile); if (s = = NULL) {if (ferror (Freadfile)) {Errdesc = Strerror (Ferror (freadfile)); return-1; else break; } strbuf + = line; //2. Encode the contents of the file UnicodeString unicodestr (Strbuf.c_str ());//Get Unicode Uconverter *conv = NULL; Uerrorcode status = U_zero_error; Conv = Ucnv_open ("Utf-8", &status);//Open Conversion service function if (u_failure (status) {Errdesc = "failed to create ICU translator handle."; return-1;}//Get target Word string uint32_t len = 0, len2 = 0; Len2 = Unicodestr.length (); uint32_t Xmllen = UnicodesTr.length () * 3; char* xmlbuf = new Char[xmllen]; uchar* uchar_buf = Unicodestr.getbuffer (Unicodestr.length ());//Get Unicode buffer Address len = ucnv_fromuchars (CONV, xmlbuf, Xmllen, Uchar_buf, Len2, &status); if (u_failure (status)) {fprintf (stderr, status = =%s/n, u_errorname (status)) return-1;}//Get target string Xmlbuf[len] = 0;//will encoded XML message truncation Membufinputsource *membufis = new Membufinputsource (const xmlbyte*) xmlbuf, strlen (xmlbuf), "Strbuf", False ); Start parsing message parser->parse (*MEMBUFIS); if (Errreporter->getsawerrors ()) {printf ("XML malformed/n"); Delete errreporter; Delete parser return-1;}//Get Document Object Docume NT = Parser->getdocument (); Get DOM root node Domrootnode = (domnode*) document->getdocumentelement (); delete []xmlbuf; }catch (otl_exception& p) {}catch (const outofmemoryexception& Outex) {}catch (const xmlexception& e) {} CATC H (const domexception& DomE) {} catch (...) { }

Some of the functions of the ICU were used, and the brothers who had not used the ICU could look at the information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More