Recently, we have received a requirement to extract the data stored in the infopath form. However, there is a problem in this process, that is, the files uploaded by IP addresses seem to have been added to some data by the attachment controls of IP addresses, which leads to a fatal error, this will damage the binary layout of the original file. If the fault tolerance is poor Program . For example, if I serialize the Word file to the hard disk to open it, 2003 will directly prompt an error, and 2007 will automatically fix the error after the error is prompted. However, we do not need this feature.
Here we go!
First, find out which data is stored in the file by the naughty infopath. Write a program to match unprocessed files with processed files byte by byte. In case of unmatched data, you will try to find the matched data. Filestream fs1 = New Filestream ( @" C: \ Source " , Filemode. Open );
Filestream fs2 = New Filestream ( @" C: \ dest " , Filemode. Open );
Binaryreader brsrc = New Binaryreader (fs1 );
Binaryreader brdst = New Binaryreader (fs2 );
Byte [] SRC = Brsrc. readbytes ( Int . Parse (brsrc. basestream. length. tostring ()));
Byte [] DST = Brdst. readbytes ( Int . Parse (brdst. basestream. length. tostring ()));
Int Destptr = 0 ;
For ( Int Srcptr = 0 ; Srcptr < SRC. length; srcptr ++ , Destptr ++ )
{
If (DST [destptr] ! = SRC [srcptr])
{
Console. writeline ( String . Format ( " The position {0} of the target file does not match the source file {1}. The character does not match {2} " , Srcptr, destptr, DST [destptr]);
For (; Destptr < DST. length; destptr ++ )
{
If (DST [destptr] = SRC [srcptr])
{
Console. writeline (string. Format ( " Locate the corresponding byte at position {0} " , Destptr); console. Read ();
Break ;
}
}
}
Else
Console. writeline ( " The {0} bid of source matches the {1} bid of DeST. " , Srcptr, destptr );
}
Console. Read ();
After testing, I found that the new file is 58 bytes larger than the old one. It seems that I guess it is correct. infopath is indeed moving in the file!
Let's look at the results again:
Fortunately, the data is only added to the header of the infopath file. The rest is the format of the header to be analyzed. Generally, the header is variable length, so the analysis format can dynamically retrieve the actual infopath file.
The 58 bytes are taken out. After Unicode decoding, the file name of the uploaded file is found. I think that there are no nodes with file names in the XML file of infopath, but the data can still be displayed in infopath, which is a good explanation.
Refer to the official infopath blogArticleHttp://blogs.msdn.com/infopath/archive/2004/03/18/92221.aspx
I found such a passage:· Byte[4]: Signature (based on the signature for PNG ):
(Decimal) 199 73 70 65
(Hexadecimal) C7 49 46 41
(Ascii c Notation )\ 307 I f
The first byte is chosen as a non-ASCII value to reduce the probability that a text file may be misrecognized as a file attachment. The rest identifies the file as an infopath file attachment.
· DWORD: size of the header
· DWORD: IP version
· DWORD: dwreserved
· DWORD: File Size
· DWORD: size of File Name Buffer
· File Name Buffer: Variable Size
Note that the header file format is divided into six parts, except for the four bytes in byte [4], the other five parts are of the DWORD type. Note that DWORD is dual, and one word is two bytes. That is to say, the size of a DWORD is four bytes. The problem is clear here. The first 4*(4 + 1) bytes are fixed, and the size of the file name is saved between 20th and 24 bytes.
Okay, now we have to find the dynamic file header and remove them from our file! The file offset must be 24 + file name length.
Int Namebufferlen = DST [ 20 ] * 2 ;
Byte [] Namebuf = New Byte [Namebufferlen];
Int Headlength = 24 + Namebufferlen;
String Name = Encoding. Unicode. getstring (DST, 0 , Headlength );
Byte [] Realcontent = New Byte [DST. Length - Headlength];
Realcontent is the content of our actual file!
Enjoy infopath!