Tutorial on correctly reading Chinese encoded files in. NET (C #)

Source: Internet
Author: User
Tags readfile


First, if the reader is not familiar with coding or BOM, it is recommended to read this article first:. NET (C #): Character encoding (Encoding) and byte order mark (BOM).

Chinese coding can be divided into two main categories:

1. ANSI encoding extension collection: such as GBK, GB2312, GB18030, etc., this type of code does not exist BOM (some newer standard Chinese encoding, such as GB18030 and GBK encoding, are backward compatible GB2312 encoding).

2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This kind of code can have BOM, also can not add BOM.

3. Some Unicode encoding also has a specific byte order problem (endianess), which is called little endian and big endian, different this order for different BOM, such as UTF16, but UTF8 there is no byte order problem.









OK, after learning the basics, let's go back to the topic, how to open the Chinese text file correctly. The first message to confirm is: Does your Unicode encoding file contain a BOM?






If the BOM is included, then everything is said! Because if we find the BOM, we will know his specific code. If the BOM is not found, it is not Unicode, we open the text file with the system default ANSI extended Chinese encoding set is OK.

And if the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files that the user gives you have BOM), then you have to manually determine from the original byte that he is GBK? Or a UTF8? Or a different code? This requires specific coding awareness algorithm (can Google "charset|encoding detection"), of course, the Code awareness algorithm is not necessarily 100% accurate, because of this, Windows Notepad will have the bush hid the facts bug. When you browse the Web, Chrome also encounters garbled characters. Personal feeling, notepad++ coding awareness is still very accurate.

There are many coding awareness algorithms, such as this project: Https://code.google.com/p/ude







If Unicode comes with a BOM, then no third-party class libraries are required. But there are some areas that need to be explained.






The problem is. The read text method in net (file class and StreamReader) is read by default in UTF8 encoding, so this type of GBK text file is opened directly with. NET (without specifying the encoding) The result must be garbled!






First, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read the text, reference code:





// Output system default non-Unicode encoding Console.WriteLine (Encoding.Default.EncodingName);

// Use the system default non-Unicode encoding to open the file

var fileContent = File.ReadAllText ("C: \ test.txt", Encoding.Default);




In the simplified Chinese Windows system should be output:






Simplified Chinese (GB2312) < text content omitted;






And using this method is actually not limited to Simplified Chinese.






Of course, you can also manually specify a code, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, the file will open successfully? The answer is still successful. The reason is. NET automatically perceives the BOM when the file is opened and then uses the code obtained according to the BOM to open the file, if there is no BOM and then open the file with the user-specified coding area, if the user does not specify the encoding, then use UTF8 encoding.






This "auto-aware BOM" parameter can be set in the StreamReader constructor, corresponding to the Detectencodingfrombyteordermarks parameter.






However, it cannot be set in the corresponding method of the file class. (For example: File.readalltext).






For example, the following code, respectively, using:






GB2312 code, automatically detect BOM to read GB2312 text






GB2312 encoding to automatically detect the BOM to read Unicode text






GB2312 encoding, without perceiving the BOM to read Unicode text





static void Main ()

{

    var gb2312 = Encoding.GetEncoding ("GB2312");

    // Encode with GB2312, automatically detect BOM to read GB2312 text

    ReadFile ("gbk.txt", gb2312, true);

    // Encode with GB2312, automatically detect BOM to read Unicode text

    ReadFile ("unicode.txt", gb2312, true);

    // Encode with GB2312 without reading BOM to read Unicode text

    ReadFile ("unicode.txt", gb2312, false);

}


// Read text via StreamReader

 static void ReadFile (string path, Encoding enc, bool detectEncodingFromByteOrderMarks)

{

    StreamReader sr;

    using (sr = new StreamReader (path, enc, detectEncodingFromByteOrderMarks))

    {

        Console.WriteLine (sr.ReadToEnd ());

    }

}




Output:





A Liu a Liu???




The third line is garbled.






Seeing above, using GB2312 encoding to open Unicode files will also succeed. Because the "auto-aware BOM" parameter is True, when the file is found to have a BOM,. NET detects a Unicode file through the BOM and then uses Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, obviously there is no BOM, so you must specify GB2312 encoding, otherwise. NET will use the default UTF8 encoding to parse the file, it is unable to read the results. Garbled in the third line is due to "automatic awareness of the BOM" is false. NET reads a Unicode encoded text file with a BOM directly with the specified GB2312 encoding, which is obviously not successful.






Of course, you can also judge the BOM, if there is no BOM, specify a default encoding to open the text. I've written in a previous article (. NET (C #): Sensing encoding from a file).






Code:




static void Main ()

{

   PrintText ("gb2312.txt");

   PrintText ("unicode.txt");

}


// Automatically detect the encoding and output the content according to the file

static void PrintText (string path)

{

   var enc = GetEncoding (path, Encoding.GetEncoding ("GB2312"));

   using (var sr = new StreamReader (path, enc))

   {

       Console.WriteLine (sr.ReadToEnd ());

   }

}


/// <summary>

/// try to return the character encoding based on the file

/// </ summary>

/// <param name = "file"> file path </ param>

/// <param name = "defEnc"> No default encoding returned by BOM </ param>

/// <returns> If the file cannot be read, return null. Otherwise, it returns the encoding determined by the BOM or the default encoding (no BOM). </ returns>

static Encoding GetEncoding (string file, Encoding defEnc)

{

   using (var stream = File.OpenRead (file))

   {

       // Judge the stream is readable?

       if (! stream.CanRead)

           return null;

       // Byte array stores BOM

       var bom = new byte [4];

       // The actual length read

       int readc;


       readc = stream.Read (bom, 0, 4);


       if (readc> = 2)

       {

           if (readc> = 4)

           {

               // UTF32, Big-Endian

               if (CheckBytes (bom, 4, 0x00, 0x00, 0xFE, 0xFF))

                   return new UTF32Encoding (true, true);

               // UTF32, Little-Endian

               if (CheckBytes (bom, 4, 0xFF, 0xFE, 0x00, 0x00))

                   return new UTF32Encoding (false, true);

           }

           // UTF8

           if (readc> = 3 && CheckBytes (bom, 3, 0xEF, 0xBB, 0xBF))

               return new UTF8Encoding (true);


           // UTF16, Big-Endian

           if (CheckBytes (bom, 2, 0xFE, 0xFF))

               return new UnicodeEncoding (true, true);

           // UTF16, Little-Endian

           if (CheckBytes (bom, 2, 0xFF, 0xFE))

               return new UnicodeEncoding (false, true);

       }


       return defEnc;

   }

}


// Helper function to determine the value in bytes

static bool CheckBytes (byte [] bytes, int count, params int [] values)

{

   for (int i = 0; i <count; i ++)

       if (bytes [i]! = values [i])

           return false;

   return true;

}



In the above code, for Unicode text, the GetEncoding method returns UTF16 encoding (more specifically: the UTF16 encoding of big or Little-endian is also returned based on the BOM), and a file without a BOM returns the default GB2312 encoding.






Related Posts:






. NET (C #): Sensing encoding from a file






. NET (C #): Character encoding (Encoding) and byte order mark (BOM)






. NET (C #): Using the System.Text.Decoder class to process "stream text"






. NET (C #): Talking about the assembly manifest resource and the RESX resource




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.