Filter invalid xml characters

Source: Internet
Author: User

Characters supported by XML

Character range
CHAR :: = # x9 | # XA | # XD | [# x20-# xd7ff] | [# xe000-# xfffd] | [# x10000-# x10ffff]/* any UNICODE character, excluding the surrogate blocks, fffe, and FFFF. */

Any UNICODE character, excluding the surrogate blocks, fffe, and FFFF.
It means that the characters supported by XML are any Unicode characters, excluding surrogate blocks, fffe, and FFFF.

0xd800 to 0 xdbff (high proxy high surrogate) and 0xdc00 to 0 xdfff (low proxy low surrogate) are called surrogate blocks (proxy block)

The proxy block is used to indicate that the supplementary characters are characters in the range [# x10000-# x10ffff ].

Supplementary characters are characters that cannot be expressed by extended 16-bit Unicode. Unicode was originally designed as a fixed-width 16-bit character encoding. The 16-bit encoding of all 65,536 characters cannot fully represent all the characters in use or used in the world. As a result, the Unicode standard is extended to contain up to 1,112,064 characters, which are supplementary characters.

 

The characters to be filtered in XML are divided into two types:

The first type is the characters that are not allowed to appear in XML. These characters are not within the definition range of XML.

The other type is the characters used by XML itself. If the content contains these characters, they must be replaced with other characters.

 

The first character

For the first type of characters, we can use the W3C XML document to check which characters are not allowed to appear in the XML document.
The allowed characters in XML are "# x9 | # XA | # XD | [# x20-# xd7ff] | [# xe000-# xfffd] | [# x10000-# x10ffff]". Therefore, we can filter out characters out of this range.
The range of characters to be filtered is:
0x00-0x08
0x0b-0x0c
0x0e-0x1f

Second-class characters
For the second type, there are a total of five characters, as shown below:
Character HTML character encoding
And (and) & amp; & #38;
Single quotes '& apos; & #39;
Double quotation marks "& quot; & #34;
Greater than> & gt; & #62;
Minor signs <& lt; & #60;
We only need to replace the five characters.

Related code:
 
The replace method of RegEx in. Net can be used to replace the characters in the three range segments in the string, for example:
String content = "as fas fasfadfasdfasdf <234234546456 ″;
Content = RegEx. Replace (content, "[// x00-// x08 // x0b-// x0c // x0e-// x1f]", "*");
Response. Write (content );
 
Use pb8 to filter characters in this range as follows:
String content = "as fas fasfadfasdfasdf <234234546456 ″;
Int I _count_eliminate = 30
Char I _spechar_eliminate [] = {"~ 001 "," ~ 002 ″,&
"~ 003 "," ~ 004 "," ~ 005 "," ~ 006 "," ~ 007 ″,&
"~ 008 "," ~ 011 "," ~ 012 "," ~ 014 "," ~ 015 ″,&
"~ 016 "," ~ 017 "," ~ 018 "," ~ 019 "," ~ 020 ″,&
"~ 021 "," ~ 022 "," ~ 023 "," ~ 024 "," ~ 025 ″,&
"~ 026 "," ~ 027 "," ~ 028 "," ~ 029 "," ~ 030 ″,&
"~ 031 "," "," "} // The characters to be eliminated, which will be replaced with null
For Vi = 1 to I _count_eliminate
VPOs = 1
Vlen = lenw (I _spechar_eliminate [VI])
Do While true
VPOs = posw (content, I _spechar_eliminate [VI], VPOs)
If VPOs <1 then exit
Content = replacew (content, VPOs, vlen ,"")
Loop
Next
 
STL can handle this as follows:
String filter_xml_marks (string in)
{
String out;
For (unsigned int I = 0; I <in. Length (); I ++)
{
If (in [I] = '&')
{
Out + = "&";
Continue;
}
Else if (in [I] = '/'')
{
Out + = "'";
Continue;
}
Else if (in [I] = '/"')
{
Out + = """;
Continue;
}
Else if (in [I] = '<')
{
Out + = "<";
Continue;
}
Else if (in [I] = '> ')
{
Out + = "> ";
Continue;
}
Else if (in [I]> = 0x00 & in [I] <= 0x08) | (in [I] >=0x0b & in [I] <= 0x0c) | (in [I] >=0x0e & in [I] <= 0x1f ))
Continue;

Out + = in [I];
}

Return out;
}

Xmlcheck is used to check the number of invalid xml characters in an XML file.

How to Use xmlcheck filename

Import java. Io .*;

Public class xmlcheck {

/**
* @ Author lxn
*
*/
Public static void main (string [] ARGs) throws ioexception {
 
If (ARGs. Length = 0)
{
System. Out. Print ("Usage: xmlcheck FILENAME ");
Return;
}


File xmlfile = new file (ARGs [0]);
If (! Xmlfile. exists ())
{
System. Out. Print ("file not exist ");
Return;
}

// Enter the XML file
Bufferedreader in = new bufferedreader (New filereader (xmlfile ));
String S;
Stringbuilder xmlsb = new stringbuilder ();
// Convert the XML file to a string
While (S = in. Readline ())! = NULL)
Xmlsb. append (S + "/N ");
In. Close ();
String xmlstring = xmlsb. tostring ();
// Todo auto-generated method stub
// No special characters
// Int I = checkcharacterdata ("<? XML version =/"1.0/" encoding =/"GBK/"?> <CC> card number </CC> ");
// Special characters
// Int I = checkcharacterdata ("<? XML version =/"1.0/" encoding =/"GBK/"?> <CC>/u001e card number </CC> ");

Int errorchar = checkcharacterdata (xmlstring );
System. Out. println ("This XML file contain" + errorchar + "errorchar .");
}
 
// Determine whether the string contains invalid characters
Public static int checkcharacterdata (string text ){
Int errorchar = 0;
If (text = NULL ){
Return errorchar;
}
Char [] DATA = text. tochararray ();
For (INT I = 0, Len = data. length; I <Len; I ++ ){
Char c = data [I];
Int result = C;
// First determine whether the proxy is in the proxy range (surrogate blocks)
// Encode the supplementary characters into two code units,
// The first unit comes from the high proxy (high surrogate) range (0xd800 to 0 xdbff ),
// The second unit comes from the low proxy (low surrogate) range (0xdc00 to 0 xdfff ).
If (result> = 0xd800 & Result <= 0 xdbff ){
// Decoder proxy pair (surrogate pair)
Int high = C;
Try {
Int low = text. charat (I + 1 );

If (low <0xdc00 | low> 0 xdfff ){
Char CH = (char) low;
}
// Unicode indicates that the defined algorithm calculates the supplementary character range from 0x10000 to 0x10ffff.
// If the result is a Supplementary Character Set, it should be between 0x10000 and 0x10ffff, which is determined by isxmlcharacter.
Result = (high-0xD800) * 0x400 + (low-0xDC00) + 0x10000;
I ++;
}
Catch (exception e ){
E. printstacktrace ();
}
}
If (! Isxmlcharacter (result )){
Errorchar ++;
}
}
Return errorchar;
}
Private Static Boolean isxmlcharacter (INT c ){
// Check unsupported characters in XML according to character range in XML specification
If (C <= 0xd7ff ){
If (C> = 0x20) return true;
Else {
If (C = '/N') return true;
If (C = '/R') return true;
If (C = '/t') return true;
Return false;
}
}
If (C <0xe000) return false; If (C <= 0 xfffd) return true;
If (C <0x10000) return false; If (C <= 0x10ffff) return true;
Return false;
}

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.