When we use system. Io. streamreader to read TXT files containing Chinese characters, we often read garbled characters (streamwriater writes text files
There is a similar problem), the reason is very simple, that is, the file encoding (encoding) does not match the streamreader/writer encoding. To solve this problem, I wrote a class to obtain the encoding of a text file. In this way, we can create the corresponding streamreader and streamwriter for read and write, so as to ensure that no garbled characters will occur. In fact, the principle is very simple. When a text editor (such as the notepad that comes with XP) generates a text file, if the encoding format is inconsistent with the default encoding format of the system (gb2312 by default in the Chinese system, A specific "encoding byte sequence identifier (encoding bit order madk, abbreviated as BOM)" will be added at the beginning of the TXT file, similar to the "MZ" file header in PE format. In this way, you can determine the encoding used when the text file is generated based on the BOM. We use Notepad for this Bom. Program It is invisible by default, but it can be read when stream is used for byte reading. My txtfileencoding class is based on this Bom "File Header" to determine the encoding used when the TXT file is generated.
// Author: Yuan Xiaohui
// 2005-8-8
////////////
Using system;
Using system. text;
Using system. IO;
Namespace farproc. Text
{
/// <Summary>
/// Encoding ).
/// </Summary>
Public class txtfileencoding
{
Public txtfileencoding ()
{
//
// Todo: add the constructor logic here
//
}
/// <Summary>
/// Obtain the encoding method of a text file. If a valid leading character cannot be found in the file header, encoding. Default will be returned.
/// </Summary>
/// <Param name = "FILENAME"> file name. </Param>
/// <Returns> </returns>
Public static encoding getencoding (string filename)
{
Return getencoding (filename, encoding. Default );
}
/// <Summary>
/// Obtain the encoding method of a text file stream.
/// </Summary>
/// <Param name = "stream"> text file stream. </Param>
/// <Returns> </returns>
Public static encoding getencoding (filestream Stream)
{
Return getencoding (stream, encoding. Default );
}
/// <Summary>
/// Obtain the encoding method of a text file.
/// </Summary>
/// <Param name = "FILENAME"> file name. </Param>
/// <Param name = "defaultencoding"> default encoding method. If this method cannot obtain a valid leading character from the file header, this encoding method is returned. </Param>
/// <Returns> </returns>
Public static encoding getencoding (string filename, encoding defaultencoding)
{
Filestream FS = new filestream (filename, filemode. Open );
Encoding targetencoding = getencoding (FS, defaultencoding );
FS. Close ();
Return targetencoding;
}
/// <Summary>
/// Obtain the encoding method of a text file stream.
/// </Summary>
/// <Param name = "stream"> text file stream. </Param>
/// <Param name = "defaultencoding"> default encoding method. If this method cannot obtain a valid leading character from the file header, this encoding method is returned. </Param>
/// <Returns> </returns>
Public static encoding getencoding (filestream stream, encoding defaultencoding)
{
Encoding targetencoding = defaultencoding;
If (stream! = NULL & stream. length> = 2)
{
// Save the first 4 bytes of the file stream
Byte byte1 = 0;
Byte byte2 = 0;
Byte byte3 = 0;
Byte byte4 = 0;
// Save the current seek location
Long origpos = stream. Seek (0, seekorigin. Begin );
Stream. Seek (0, seekorigin. Begin );
Int nbyte = stream. readbyte ();
Byte1 = convert. tobyte (nbyte );
Byte2 = convert. tobyte (stream. readbyte ());
If (stream. length> = 3)
{
Byte3 = convert. tobyte (stream. readbyte ());
}
If (stream. length> = 4)
{
Byte4 = convert. tobyte (stream. readbyte ());
}
// Determine Encoding Based on the first four bytes of the file stream
// Unicode {0xff, 0xfe };
// Be-Unicode {0xfe, 0xff };
// Utf8 = {0xef, 0xbb, 0xbf };
If (byte1 = 0xfe & byte2 = 0xff) // unicodebe
{
Targetencoding = encoding. bigendianunicode;
}
If (byte1 = 0xff & byte2 = 0xfe & byte3! = 0xff) // Unicode
{
Targetencoding = encoding. Unicode;
}
If (byte1 = 0xef & byte2 = 0xbb & byte3 = 0xbf) // utf8
{
Targetencoding = encoding. utf8;
}
// Restore the seek location
Stream. Seek (origpos, seekorigin. Begin );
}
Return targetencoding;
}
}
}
Because both gb2312 and utf7 do not have Bom, You need to specify a default encoding. If a valid BOM cannot be found, this encoding will be returned. Anyone who knows how to distinguish gb2312 from utf7 to encode a TXT file can tell me. Because it is only a static method, you do not need to use new. You can call the method by using the class name, which is also very simple to use.
Using system;
Using farproc. text;
Using system. text;
Using system. IO;
Namespace consoleapplication1
{
/// <Summary>
/// Summary of class1.
/// </Summary>
Class class1
{
/// <Summary>
/// Main entry point of the application.
/// </Summary>
[Stathread]
Static void
Main (string [] ARGs)
{
//
// Todo: addCodeTo start the application
//
String filename = @ "E: \ a.txt ";
// Generate a text file in the big endian unicode encoding format
Streamwriter Sw = new streamwriter (filename, false, encoding. bigendianunicode); // you can try other encodings, such as encoding. getencoding ("gb2312") or utf8
Sw. Write ("this is a string ");
Sw. Close ();
// Read
Encoding fileencoding = txtfileencoding. getencoding (filename, encoding. getencoding ("gb2312"); // get the encoding of the TXT file
Console. writeline ("this text file is encoded as:" + fileencoding. encodingname );
Streamreader sr = new streamreader (filename, fileencoding); // use this encoding to create streamreader
// Although the following method allows the system to automatically determine the encoding format of a text file, we cannot obtain the encoding of the text file.
// Sr. currentencoding always Unicode (UTF-8)
// Streamreader sr = new streamreader (filename, true );
// Console. writeline ("this text file is encoded as:" + Sr. currentencoding. encodingname );
Console. writeline ("the content of this text file is:" + Sr. readtoend ());
Sr. Close ();
Console. Readline ();
}
}
}
The string in. NET is always Unicode, so you can only judge the encoding of the TXT file. For byte [], only the encoding that you know can be converted to string and converted to byte [] of another encoding. one exception is to read the entire TXT file through stream into byte [] and then judge Encoding Based on its first few bytes. We can't do anything about fragment :)