[C #] solve the problem of garbled files when reading and writing the TXT file containing Chinese characters

Source: Internet
Author: User
Tags header valid
Chinese character | solve | problem Author: Shaohui (All rights Reserved) Time: 2005-8-8 When we read the TXT file containing Chinese characters with System.IO.StreamReader, it is often read garbled (Streamwriater write text file has similar problem), the reason is very simple, is the file encoding (encoding) and StreamReader    The encoding of the/writer does not correspond. To solve this problem, I wrote a class to get a text file encoding, so we can create the corresponding StreamReader and StreamWriter to read and write, to ensure that no garbled phenomenon. In fact, the principle is very simple, text editor (such as XP from the Notepad) when generating text files, if the encoding format and the system default encoding (the Chinese system defaults to GB2312), the TXT file will be added at the beginning of a specific "encoded byte order Identifier" (Encoding Bit Ordering MADK, abbreviated as BOM) ", similar to the PE format" MZ "file header. This allows it to determine the encoding used when the text file is generated, depending on the BOM when it is read. This BOM we use Notepad and so on to open the default is invisible, but with the stream by byte read can be read. My Txtfileencoding class is based on this BOM "file header" to determine the TXT file generated by the encoding.

Author: Shaohui

2005-8-8

// // // // // //

Using System;

Using System.Text;

Using System.IO;

Namespace Farproc.text

{

<summary>

The encoding used to obtain a text file (Encoding).

</summary>

public class Txtfileencoding

{

Public txtfileencoding ()

{

//

TODO: Add constructor logic here

//

}

<summary>

Gets how a text file is encoded. If a valid leader is not found on the file's head, Encoding.default is returned.

</summary>

<param name= "filename" > filename. </param>

<returns></returns>

public static Encoding GetEncoding (String fileName)

{

Return GetEncoding (FileName, Encoding.default);

}

<summary>

Gets how a text file stream is encoded.

</summary>

<param name= "Stream" > Text file stream. </param>

<returns></returns>

public static Encoding GetEncoding (FileStream stream)

{

Return GetEncoding (stream, Encoding.default);

}

<summary>

Gets how a text file is encoded.

</summary>

<param name= "filename" > filename. </param>

<param name= "defaultencoding" > Default encoding method. This encoding is returned when the method cannot obtain a valid leader from the head of the file. </param>

<returns></returns>

public static Encoding GetEncoding (String fileName, Encoding defaultencoding)

{

FileStream fs = new FileStream (FileName, FileMode.Open);

Encoding targetencoding = getencoding (FS, defaultencoding);

Fs. Close ();

return targetencoding;

}

<summary>

Gets how a text file stream is encoded.

</summary>

<param name= "Stream" > Text file stream. </param>

<param name= "defaultencoding" > Default encoding method. This encoding is returned when the method cannot obtain a valid leader from the head of the file. </param>

<returns></returns>

public static Encoding GetEncoding (FileStream stream, Encoding defaultencoding)

{

Encoding targetencoding = defaultencoding;

if (stream!= null && stream. Length >= 2)

{

Save the first 4 bytes of a file stream

byte byte1 = 0;

byte Byte2 = 0;

byte byte3 = 0;

byte byte4 = 0;

Save current seek location

Long Origpos = stream. Seek (0, seekorigin.begin);

Stream. Seek (0, seekorigin.begin);

int nbyte = stream. ReadByte ();

Byte1 = Convert.tobyte (nbyte);

Byte2 = Convert.tobyte (stream. ReadByte ());

if (stream. Length >= 3)

{

Byte3 = Convert.tobyte (stream. ReadByte ());

}

if (stream. Length >= 4)

{

Byte4 = Convert.tobyte (stream. ReadByte ());

}

Judge encoding based on the first 4 bytes of the file stream

Unicode {0xFF, 0xFE};

Be-unicode {0xFE, 0xFF};

UTF8 = {0xEF, 0xBB, 0xBF};

if (byte1 = 0xFE && byte2 = 0xFF)//unicodebe

{

targetencoding = Encoding.bigendianunicode;

}

if (byte1 = 0xFF && byte2 = 0xFE && byte3!= 0xFF)//unicode

{

targetencoding = Encoding.unicode;

}

if (byte1 = 0xEF && byte2 = 0xBB && byte3 = 0xBF)//utf8

{

targetencoding = Encoding.UTF8;

}

Restore Seek location

Stream. Seek (Origpos, seekorigin.begin);

}

return targetencoding;

}

}

}

Because there is no BOM in both GB2312 and UTF7 encodings, you need to specify a default encoding that will return this encoding when no legal BOM is found.    Who knows how to distinguish GB2312 and UTF7 encoded TXT file method, also please tell me. Because it's just a static method, it's simple to call a method directly by using the class name without new.

Using System;

Using Farproc.text;

Using System.Text;

Using System.IO;

Namespace ConsoleApplication1

{

<summary>

Summary description of the CLASS1.

</summary>

Class Class1

{

<summary>

The main entry point for the application.

</summary>

[STAThread]

static void Main (string[] args)

{

//

TODO: Add code here to start the application

//

String fileName = @ "E:\a.txt";

Generate a big endian text file in Unicode encoding format

StreamWriter sw = new StreamWriter (FileName, False, Encoding.bigendianunicode);//You can try other encodings, such as encoding.getencoding (" GB2312 ") or UTF8

Sw. Write ("This is a string");

Sw. Close ();

Read

Encoding fileencoding = txtfileencoding.getencoding (FileName, encoding.getencoding ("GB2312"))//Get the encoding of this TXT file

Console.WriteLine ("This text file is encoded as:" + fileencoding.encodingname);

StreamReader sr = new StreamReader (FileName, fileencoding);//Create StreamReader with this encoding

The following method allows the system to automatically determine the text file encoding format, but we can not get the text file encoding

Sr. Currentencoding is always Unicode (UTF-8)

StreamReader sr = new StreamReader (FileName, true);

Console.WriteLine ("This text file is encoded as:" + Sr.) Currentencoding.encodingname);

Console.WriteLine ("The content of this text file is:" + Sr.) ReadToEnd ());

Sr. Close ();

Console.ReadLine ();

}

}

}

. NET string is always Unicode, so can only judge txt file encoding. For byte[], only if you know its encoding can be converted to a string conversion to another encoded byte[], an exception is to read the entire TXT file through the stream into the byte[, or according to its first few bytes to judge encoding, for fragments, There is nothing we can do about it:



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.