[C #] solve the problem of garbled characters when reading and writing TXT files containing Chinese Characters

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When we use system. Io. streamreader to read TXT files containing Chinese characters, we often read garbled characters (streamwriater writes text files

There is a similar problem), the reason is very simple, that is, the file encoding (encoding) does not match the streamreader/writer encoding. To solve this problem, I wrote a class to obtain the encoding of a text file. In this way, we can create the corresponding streamreader and streamwriter for read and write, so as to ensure that no garbled characters will occur. In fact, the principle is very simple. When a text editor (such as the notepad that comes with XP) generates a text file, if the encoding format is inconsistent with the default encoding format of the system (gb2312 by default in the Chinese system, A specific "encoding byte sequence identifier (encoding bit order madk, abbreviated as BOM)" will be added at the beginning of the TXT file, similar to the "MZ" file header in PE format. In this way, you can determine the encoding used when the text file is generated based on the BOM. We use Notepad for this Bom. Program It is invisible by default, but it can be read when stream is used for byte reading. My txtfileencoding class is based on this Bom "File Header" to determine the encoding used when the TXT file is generated.

// Author: Yuan Xiaohui

// 2005-8-8

////////////

Using system;

Using system. text;

Using system. IO;

Namespace farproc. Text

{

/// <Summary>

/// Encoding ).

/// </Summary>

Public class txtfileencoding

{

Public txtfileencoding ()

{

// Todo: add the constructor logic here

}

/// <Summary>

/// Obtain the encoding method of a text file. If a valid leading character cannot be found in the file header, encoding. Default will be returned.

/// </Summary>

/// <Param name = "FILENAME"> file name. </Param>

/// <Returns> </returns>

Public static encoding getencoding (string filename)

{

Return getencoding (filename, encoding. Default );

}

/// <Summary>

/// Obtain the encoding method of a text file stream.

/// </Summary>

/// <Param name = "stream"> text file stream. </Param>

/// <Returns> </returns>

Public static encoding getencoding (filestream Stream)

{

Return getencoding (stream, encoding. Default );

}

/// <Summary>

/// Obtain the encoding method of a text file.

/// </Summary>

/// <Param name = "FILENAME"> file name. </Param>

/// <Param name = "defaultencoding"> default encoding method. If this method cannot obtain a valid leading character from the file header, this encoding method is returned. </Param>

/// <Returns> </returns>

Public static encoding getencoding (string filename, encoding defaultencoding)

{

Filestream FS = new filestream (filename, filemode. Open );

Encoding targetencoding = getencoding (FS, defaultencoding );

FS. Close ();

Return targetencoding;

}

/// <Summary>

/// Obtain the encoding method of a text file stream.

/// </Summary>

/// <Param name = "stream"> text file stream. </Param>

/// <Param name = "defaultencoding"> default encoding method. If this method cannot obtain a valid leading character from the file header, this encoding method is returned. </Param>

/// <Returns> </returns>

Public static encoding getencoding (filestream stream, encoding defaultencoding)

{

Encoding targetencoding = defaultencoding;

If (stream! = NULL & stream. length> = 2)

{

// Save the first 4 bytes of the file stream

Byte byte1 = 0;

Byte byte2 = 0;

Byte byte3 = 0;

Byte byte4 = 0;

// Save the current seek location

Long origpos = stream. Seek (0, seekorigin. Begin );

Stream. Seek (0, seekorigin. Begin );

Int nbyte = stream. readbyte ();

Byte1 = convert. tobyte (nbyte );

Byte2 = convert. tobyte (stream. readbyte ());

If (stream. length> = 3)

{

Byte3 = convert. tobyte (stream. readbyte ());

}

If (stream. length> = 4)

{

Byte4 = convert. tobyte (stream. readbyte ());

}

// Determine Encoding Based on the first four bytes of the file stream

// Unicode {0xff, 0xfe };

// Be-Unicode {0xfe, 0xff };

// Utf8 = {0xef, 0xbb, 0xbf };

If (byte1 = 0xfe & byte2 = 0xff) // unicodebe

{

Targetencoding = encoding. bigendianunicode;

}

If (byte1 = 0xff & byte2 = 0xfe & byte3! = 0xff) // Unicode

{

Targetencoding = encoding. Unicode;

}

If (byte1 = 0xef & byte2 = 0xbb & byte3 = 0xbf) // utf8

{

Targetencoding = encoding. utf8;

}

// Restore the seek location

Stream. Seek (origpos, seekorigin. Begin );

}

Return targetencoding;

}

Because both gb2312 and utf7 do not have Bom, You need to specify a default encoding. If a valid BOM cannot be found, this encoding will be returned. Anyone who knows how to distinguish gb2312 from utf7 to encode a TXT file can tell me. Because it is only a static method, you do not need to use new. You can call the method by using the class name, which is also very simple to use.

Using system;

Using farproc. text;

Using system. text;

Using system. IO;

Namespace consoleapplication1

{

/// <Summary>

/// Summary of class1.

/// </Summary>

Class class1

{

/// <Summary>

/// Main entry point of the application.

/// </Summary>

[Stathread]

Static void

Main (string [] ARGs)

{

// Todo: addCodeTo start the application

String filename = @ "E: \ a.txt ";

// Generate a text file in the big endian unicode encoding format

Streamwriter Sw = new streamwriter (filename, false, encoding. bigendianunicode); // you can try other encodings, such as encoding. getencoding ("gb2312") or utf8

Sw. Write ("this is a string ");

Sw. Close ();

// Read

Encoding fileencoding = txtfileencoding. getencoding (filename, encoding. getencoding ("gb2312"); // get the encoding of the TXT file

Console. writeline ("this text file is encoded as:" + fileencoding. encodingname );

Streamreader sr = new streamreader (filename, fileencoding); // use this encoding to create streamreader

// Although the following method allows the system to automatically determine the encoding format of a text file, we cannot obtain the encoding of the text file.

// Sr. currentencoding always Unicode (UTF-8)

// Streamreader sr = new streamreader (filename, true );

// Console. writeline ("this text file is encoded as:" + Sr. currentencoding. encodingname );

Console. writeline ("the content of this text file is:" + Sr. readtoend ());

Sr. Close ();

Console. Readline ();

}

The string in. NET is always Unicode, so you can only judge the encoding of the TXT file. For byte [], only the encoding that you know can be converted to string and converted to byte [] of another encoding. one exception is to read the entire TXT file through stream into byte [] and then judge Encoding Based on its first few bytes. We can't do anything about fragment :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[C #] solve the problem of garbled characters when reading and writing TXT files containing Chinese Characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[C #] solve the problem of garbled characters when reading and writing TXT files containing Chinese Characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support