When getting XML, "(hexadecimal value 0x1F) is a workaround for XML exceptions such as invalid characters

Source: Internet
Author: User
Tags readline serialization


http://hi.baidu.com/zeratul_bb/blog/item/3e2a44cf085cf33af8dc61de.html







recently do news collector, need to get a lot of site XML, loading individual sites often appear "(hexadecimal value 0x1F) is invalid character" problem, baffled by its solution. For the problem site XML processing, the beginning of the idea is that since the direct use of the XmlDocument object Load () method does not work, with Loadxml (), with the HttpWebRequest to get the URL read into the stream and then into the XML, the middle can add some of the non-valid characters filtered processing, But still invalid, only resolved the problem of request Timeout ...



The problem has been shelved for 1 weeks and is finally resolved today.



actually, it's simple, just add one statement to it.



XmlDocument doc = new XmlDocument ();



Doc. Normalize ();



Summary:
Converts all XmlText nodes in the full depth of this XmlNode to "normal" form, in which only tokens (that is, tags, annotations, processing instructions, CDATA
Sections and entity references) separate XMLTEXT nodes, that is, there are no adjacent XmlText nodes.



Here is a post for a friend:



Recently, I had a problem with a Web service call that exposed the information recorded in the database. Report the following error message:



System.InvalidOperationException was unhandled
Message= "XML document (1, 823) has errors. "
Source= "System.Xml"
Message= "" (hexadecimal value 0x0e) is an invalid character. Line 1, Position 823. "
Source= "System.Xml"



When this error occurs, the Web service server side will not have any errors, and the client calling the Web service will report the above error.
What is the cause of this problem?
The answer is simple, because there are Low-order, nonprinting ASCII characters in the XML document exposed by the Web Service.
When we look at the XML document document that the Web Service returns, there is the following XML document section: the Low-order ASCII character. After the corresponding character Furu:



<Value> in the Magic world who as wind and rain </Value>



The Low-order nonprinting ASCII characters that cause these problems contain the following characters:
#x0-#x8 (ASCII 0-8)
#xB-#xC (ASCII 11-12)
#xE-#x1F (ASCII 14-31)



Here's a console program that simply demonstrates this problem,
For simplicity, instead of creating a WebService, a class XML serialization is stored to a file and then deserialized to read:
In the value of this class, a low-order, nonprinting ASCII character is placed.
If you execute this console program, you will report an exception. "There is an error in the XML document (3, 12). ”


using System;
using System.Xml.Serialization;
using System.IO;
using System.Text;
using System.Globalization;
namespace TextSerialize
{
[Serializable]
public class MyClass
{
public string Value { get; set; }
}
class Program
{
static void Main(string[] args)
{
string fileName = "d://1.txt";
MyClass c = new MyClass();
c. Value = string. Format ("who is in the magic {0} world", convert. Tochar (14));
SaveAsXML(c, fileName, Encoding.UTF8);
object o = ConvertFileToObject(fileName, typeof(MyClass), Encoding.UTF8);
MyClass d = o as MyClass;
if (d != null) Console.WriteLine(d.Value);
else Console.WriteLine("null");
Console.ReadLine();
}
/// <summary>
///Serialization
/// </summary>
/// <param name="objectToConvert"></param>
/// <param name="path"></param>
/// <param name="encoding"></param>
public static void SaveAsXML(object objectToConvert, string path, Encoding encoding)
{
if (objectToConvert != null)
{
Type t = objectToConvert.GetType();
XmlSerializer ser = new XmlSerializer(t);
using (StreamWriter writer = new StreamWriter(path, false, encoding))
{
ser.Serialize(writer, objectToConvert);
writer.Close();
}
}
}
/// <summary>
///Deserialization
/// </summary>
/// <param name="path"></param>
/// <param name="objectType"></param>
/// <param name="encoding"></param>
/// <returns></returns>
public static object ConvertFileToObject(string path, Type objectType, Encoding encoding)
{
object convertedObject = null;
if (!string.IsNullOrEmpty(path))
{
XmlSerializer ser = new XmlSerializer(objectType);
using (StreamReader reader = new StreamReader(path, encoding))
{
convertedObject = ser.Deserialize(reader);
reader.Close();
}
}
return convertedObject;
}
}
}


The problem with the Web Service mentioned above is the same as the demo program.



We need to serialize the content, when there are Low-order nonprinting ASCII characters,. NET will give us normal serialization, will automatically convert Low-order nonprinting ASCII characters to & #x encoded characters (this XML specification requires this).



However, deserialization can be an error if deserialized content is required if there is a & #x encoded character (mapped to a low-order, nonprinting ASCII character).






If we solve this problem.



The most thorough solution, of course, is to modify the deserialized code so that the characters do not go wrong. But this thing is not in our control very often. This scheme is not feasible.



The next option is to eliminate these disruptive characters.



The solution I'm going to give here is to do a preprocessing of these characters once they are serialized, and a reverse process when deserializing.
Here, for the sake of demonstration, the logic here is to convert Low-order nonprinting ASCII characters to & #x encoded characters, and to convert the & #x encoded characters into Low-order, nonprinting ASCII characters.
This allows you to use the functions I have provided here to implement more processing logic. The code for the two functions is as follows:


/// <summary>
///Replace the low ordinal ASCII character in a string with the &amp; (x) character
///Convert ASCII 0 - 8 - >
///Convert ASCII 11 - 12 - > &amp;
///Convert ASCII 14 - 31 - > &amp;
/// </summary>
/// <param name="tmp"></param>
/// <returns></returns>
public static string ReplaceLowOrderASCIICharacters(string tmp)
{
StringBuilder info = new StringBuilder();
foreach (char cc in tmp)
{
int ss = (int)cc;
if (((ss >= 0) &amp;&amp; (ss <= 8)) || ((ss >= 11) &amp;&amp; (ss <= 12)) || ((ss >= 14) &amp;&amp; (ss <= 32)))
info.AppendFormat("&amp;#x{0:X};", ss);
else info.Append(cc);
}
return info.ToString();
}
/// <summary>
///Replace the following characters in a string with low order ASCII characters
///Conversion &amp; "x0 - &amp;" X8 - > ASCII 0 - 8
///Conversion &amp; "XB - &amp;" XC - > ASCII 11 - 12
///Conversion &amp; "XE - &amp;" x1f - > ASCII 14 - 31
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
public static string GetLowOrderASCIICharacters(string input)
{
if (string.IsNullOrEmpty(input)) return string.Empty;
int pos, startIndex = 0, len = input.Length;
if (len <= 4) return input;
StringBuilder result = new StringBuilder();
while ((pos = input.IndexOf("&amp;#x", startIndex)) >= 0)
{
bool needReplace = false;
string rOldV = string.Empty, rNewV = string.Empty;
int le = (len - pos < 6) ? len - pos : 6;
int p = input.IndexOf(";", pos, le);
If (P > 0)
{
rOldV = input.Substring(pos, p - pos + 1);
//Calculate the corresponding low order character
Short SS;
if (short.TryParse(rOldV.Substring(3, p - pos - 3), NumberStyles.AllowHexSpecifier, null, out ss))
{
if (((ss >= 0) &amp;&amp; (ss <= 8)) || ((ss >= 11) &amp;&amp; (ss <= 12)) || ((ss >= 14) &amp;&amp; (ss <= 32)))
{
needReplace = true;
rNewV = Convert.ToChar(ss).ToString();
}
}
POS = P + 1;
}
else pos += le;
string part = input.Substring(startIndex, pos - startIndex);
if (needReplace) result.Append(part.Replace(rOldV, rNewV));
else result.Append(part);
startIndex = pos;
}
result.Append(input.Substring(startIndex));
return result.ToString();
}




In this way, the Main function of our demo program is modified to the following code, and no error occurs.


static void Main(string[] args)
{
Console.WriteLine(GetLowOrderASCIICharacters("123456&amp;#x50000"));
Console.WriteLine(GetLowOrderASCIICharacters("123456&amp;#x5"));
Console.WriteLine(GetLowOrderASCIICharacters("&amp;#x5"));
Console.WriteLine(GetLowOrderASCIICharacters("0123 456789"));
Console.WriteLine(GetLowOrderASCIICharacters("/f"));
Console.WriteLine(GetLowOrderASCIICharacters(" =-1"));
Console.WriteLine(GetLowOrderASCIICharacters(" "));
Console.WriteLine(GetLowOrderASCIICharacters(" "));
string fileName = "d://1.txt";
MyClass c = new MyClass();
c. Value = string. Format ("who is in the magic {0} world", convert. Tochar (14));
c.Value = ReplaceLowOrderASCIICharacters(c.Value);
SaveAsXML(c, fileName, Encoding.UTF8);
object o = ConvertFileToObject(fileName, typeof(MyClass), Encoding.UTF8);
MyClass d = o as MyClass;
if (d != null)
{
d.Value = GetLowOrderASCIICharacters(d.Value);
Console.WriteLine(d.Value);
}
else Console.WriteLine("null");
Console.ReadLine();
}
Summary: low order non printing ASCII characters will bring problems to our system in many times, and this part of characters must be treated specially.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.