. Net read/write with BOM character encoding
Problem description:
Recently encountered the following problem, upload an UTF-8-encoded XML file to the server, and then use xmldocument to parse the XML file, the file format error is prompted, the result shows that an "? ", The parsing fails. Analyzed from the uploaded XML file stream, the top three of the stream are EF, BB, and BF, which is the BOM identifier of the UTF-8. What is Bom? How can I use it correctly? How can we avoid such problems? See the following.
1. What is Character Sequence mark (BOM)
The data stored in the computer is binary. This data is meaningful only when you know the binary storage format of a piece of data. The so-called text file is actually using a specific character encoding to convert Binary source data into text. Most text editors can edit text files of different encodings. How does the text editor know the text encoding of this data through the source binary data? The answer is byte order mark. In this article, we use BOM to indicate this term.
Below is the BOM for commonly used unicode encoding
UTF-8: EF BB BF
UTF-16 big endian: Fe FF
Little endian: FF Fe UTF-16
UTF-32 big endian: 00 00 Fe FF
Little endian: FF Fe 00 UTF-32
2. Encoding class and BOM in. net
In the. NET world, we often use the static attribute of encoding to get an encoding class.Bom is provided by default (if Bom is supported ).
If you want to disable BOM for the specified encoding, You need to manually construct this encoding class.
// Encoding utf8nobom = new utf8encoding (false); encoding utf16nobom = new unicodeencoding (false, false); encoding utf32nobom = new encode (false, false );
The getpreamble method in the encoding class can return the BOM provided by the current encoding.
3. file read/write and BOM
When writing text,The default encoding of the streamwriter class and the file. writealltext method is utf8 without Bom.
Of course, we can specify another encoding by using the constructor. The constructor is the same as described above. For example:
Public static void main () {encoding utf32bigbom = new utf32encoding (True, true); encoding utf32litbom = new partition (false, true); encoding utf32litnobom = new utf32encoding (false, false ); vaR content = "ABCDE"; writeandprint (content, utf32bigbom); writeandprint (content, utf32litbom); writeandprint (content, UTF-8 32litnobom);} static void writeandprint (string content, encoding ENC) {var Path = path. gettempfilename (); file. writealltext (path, content, ENC); printbytes (file. readallbytes (PATH);} static void printbytes (byte [] bytes) {If (Bytes = NULL | bytes. length = 0) console. writeline ("<no value>"); foreach (var B in bytes) console. write ("{0: X2}", B); console. writeline ();}
Output:
00 00 Fe ff 00 00 00 61 00 00 00 62 00 00 00 63 00 00 00 64 00 00 00 65
FF Fe 00 00 61 00 00 00 62 00 00 00 63 00 00 00 64 00 00 65 00 00 00 00
61 00 00 00 62 00 00 00 63 00 00 00 64 00 00 00 65 00 00 00 00
It can be seen that 00 00 Fe FF is the BOM of UTF32 big endian, while FF Fe 00 00 is the BOM of UTF32 little endian, and the third row is the source binary data of UTF32 without Bom.
When reading text,When stringreader class is constructed into a specified string path or stream object, stringreader automatically uses BOM to determine character encoding, of course, you can also manually specify an encoding (especially text data without Bom, which cannot be correctly read without manual encoding ).
Similarly, the readalltext of the file class has the same function. However, readers may find that file in reflector. the source code of readalltext is to use utf8-encoded streamreader to read files. In fact, it calls this constructor in streamreader:
Public streamreader (string path, encoding, bool detectencodingfrombyteordermarks, int buffersize) {/* content omitted */}
Although a specific encoding is passed in, the parameter detectencodingfrombyteordermarks is true, and streamreader will automatically detect BOM to read the file.
Code:
Public static void main () {var path1 = path. gettempfilename (); var path2 = path. gettempfilename (); string content = "ABC"; // use utf8 encoding without bom by default to write the file. writealltext (path1, content); // use utf8 encoded file with Bom. writealltext (path2, content, encoding. utf8); printbytes (file. readallbytes (path1); printbytes (file. readallbytes (path2); console. writeline (file. readalltext (path1); console. writeline (file. readalltext (path2);} static void printbytes (byte [] bytes) {foreach (var B in bytes) console. write ("{0: X2}", B); console. writeline ();}
Output:
61 62 63
Ef bb bf 61 62 63
ABC
ABC
We can see that there is no BOM in the above file, but there is no error due to the default utf8, but other encoding is not like this.
For example, in the following code, we use UTF32 encoding:
Public static void main () {var path1 = path. gettempfilename (); var path2 = path. gettempfilename (); string content = "ABC"; // use the UTF-8 encoded file with Bom. writealltext (path1, content, encoding. unicode); // use UTF32 encoding without BOM to write a file. writealltext (path2, content, new unicodeencoding (false, false); printbytes (file. readallbytes (path1); printbytes (file. readallbytes (path2); // automatically detects the BOM Read File string C1 = file. readalltext (path1); // path2 does not have Bom. In fact, the default utf8 is used to read the file string C2 = file. readalltext (path2); // No BOM for path2. Use the degree of parallelism UTF16 to read the file string C3 = file. readalltext (path2, encoding. unicode); showcontent (C1); showcontent (C2); showcontent (C3);} static void showcontent (string content) {console. writeline ("number of characters read: {0} content: {1}", content. length, content);} static void printbytes (byte [] bytes) {foreach (var B in bytes) console. write ("{0: X2}", B); console. writeline ();}
Output:
FF Fe 61 00 62 00 63 00 // file 1 is UTF16 with BOM
61 00 62 00 63 00 // file 2 is UTF16 without BOM
Number of characters read: 3 content: ABC // automatically Read File 1
Number of characters read: 6 content: A // automatically read file 2
Number of characters read: 3 content: ABC // specified UTF16 encoding to read file 2
Check row 4. Because the UTF16 file without Bom is read as utf8, the original three characters are read as 6 characters.
4. How to remove BOM
Sometimes we need to process the binary data of the text. In this case, we need to obtain the binary array of all the text. The BOM is included at the beginning when binary data can be read, bom lengths of different codes are different (some codes do not have BOM). In this case, some methods are required to filter out Bom.
When you know the encoding. getpreamble method (as mentioned earlier), everything is not hard.
Three functions are provided here, which is a common scenario.
One is to directly get the byte array to remove Bom.
The second is to move the stream position to the BOM, so that subsequent stream operations will directly target the binary data of each character.
Third, use streamreader (the value of detectencodingfrombyteordermarks is true) to automatically detect the BOM and skip it.
Public static void main () {var Path = path. gettempfilename (); file. writealltext (path, "A123 1", encoding. utf8); printbytes (file. readallbytes (PATH); // 1 printbytes (getbyteswithoutbom (path, encoding. utf8); // 2 using (Stream stream = file. openread (PATH) {skipbom (stream, encoding. utf8); int data; while (Data = stream. readbyte ())! =-1) console. write ("{0: X2}", data); console. writeline ();} // 3 using (Stream stream = file. openread (PATH) {streamreader reader = new streamreader (stream, encoding. utf8, true); char [] cs = new char [64]; stringbuilder sb = new stringbuilder (); int Len = 0; while (LEN = reader. read (CS, 0, CS. length)> 0) {sb. append (CS, 0, Len);} string STR = sb. tostring (); byte [] BS = encoding. utf8.getbytes (STR); foreach (byte B in BS) {console. write ("{0: X2}", B) ;}} static byte [] getbyteswithoutbom (string path, encoding ENC) {/LINQ return file. readallbytes (PATH ). skip (ENC. getpreamble (). length ). toarray ();} static void skipbom (Stream stream, encoding ENC) {stream. seek (ENC. getpreamble (). length, seekorigin. begin);} static void printbytes (byte [] bytes) {foreach (var B in bytes) console. write ("{0: X2}", B); console. writeline ();}
Output:
Ef bb bf 61 31 32 33 E4 B8 80
61 31 32 33 E4 B8 80
61 31 32 33 E4 B8 80
61 31 32 33 E4 B8 80
(All results are correct)
The best solution to the problem mentioned at the beginning of the article is to use streamreader to read strings from the file upload stream and then use xmldocument for parsing.