TinyXML parsing fails when parsing an XML file in UTF-8 format, if the file contains the following two strings: "<name> Humanities </name>" and "<name> Courier </name>".
Parse the code and discover that the reason for the failure is the following code:
Functions in the Tinyxmlparser.cpp file: const char* Tixmlbase::readtext ()
1 intLen; 2 Charcarr[4] = {0,0,0,0 }; 3p = GetChar (P, CARR, &len, encoding); 4 if(len = =1 ) 5(*text) + = carr[0];//More Efficient6 Else 7Text->append (CARR, Len);
Preliminary analysis is the parsing problem of UTF-8 string.
The parse uses the following table:
1 Const inttixmlbase::utf8bytetable[ the] = {//0 1 2 3 4 5 6 7 8 9 a b c D e F1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1,//0x001, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,2 1,1,1,1,//0x101, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,//0x201, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,//0x301, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,//0x401, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,//0x501, 1, 1, 1, 1, 1,3 1,1,1,1,1,1,1,1,1,1,//0x601, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,//0x70 End of ASCII range1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,//0x80 0x80 to 0xc1 invalid1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,//0x90 1, 1, 1, 1, 1,4 1,1,1,1,1,1,1,1,1,1,1,//0xa0 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1,//0xb0 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 , 2,//0xc0 0xc2 to 0XDF 2 Byte2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 , 2,//0xd03, 3, 3, 3, 3, 3, 3, 3, 3, 3,5 3,3,3,3,3,3,//0xe0 0xe0 to 0xEF 3 Byte4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1//0xf0 0xf0 to 0xf4 4 byte, 0xf5 and Higher Invalid};
This form can be reached on Google's website. Why parsing failed, reason unknown origin.
TinyXML parsing of UTF-8 strings