Tutorial on correctly reading Chinese encoded files in. NET (C #)

Last Update:2017-04-24 Source: Internet

Author: User

Tags readfile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, if the reader is not familiar with coding or BOM, it is recommended to read this article first:. NET (C #): Character encoding (Encoding) and byte order mark (BOM).

Chinese coding can be divided into two main categories:

1. ANSI encoding extension collection: such as GBK, GB2312, GB18030, etc., this type of code does not exist BOM (some newer standard Chinese encoding, such as GB18030 and GBK encoding, are backward compatible GB2312 encoding).

2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This kind of code can have BOM, also can not add BOM.

3. Some Unicode encoding also has a specific byte order problem (endianess), which is called little endian and big endian, different this order for different BOM, such as UTF16, but UTF8 there is no byte order problem.

OK, after learning the basics, let's go back to the topic, how to open the Chinese text file correctly. The first message to confirm is: Does your Unicode encoding file contain a BOM?

If the BOM is included, then everything is said! Because if we find the BOM, we will know his specific code. If the BOM is not found, it is not Unicode, we open the text file with the system default ANSI extended Chinese encoding set is OK.

And if the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files that the user gives you have BOM), then you have to manually determine from the original byte that he is GBK? Or a UTF8? Or a different code? This requires specific coding awareness algorithm (can Google "charset|encoding detection"), of course, the Code awareness algorithm is not necessarily 100% accurate, because of this, Windows Notepad will have the bush hid the facts bug. When you browse the Web, Chrome also encounters garbled characters. Personal feeling, notepad++ coding awareness is still very accurate.

There are many coding awareness algorithms, such as this project: Https://code.google.com/p/ude

If Unicode comes with a BOM, then no third-party class libraries are required. But there are some areas that need to be explained.

The problem is. The read text method in net (file class and StreamReader) is read by default in UTF8 encoding, so this type of GBK text file is opened directly with. NET (without specifying the encoding) The result must be garbled!

First, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read the text, reference code:

// Output system default non-Unicode encoding Console.WriteLine (Encoding.Default.EncodingName);

// Use the system default non-Unicode encoding to open the file

var fileContent = File.ReadAllText ("C: \ test.txt", Encoding.Default);

In the simplified Chinese Windows system should be output:

Simplified Chinese (GB2312) < text content omitted;

And using this method is actually not limited to Simplified Chinese.

Of course, you can also manually specify a code, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, the file will open successfully? The answer is still successful. The reason is. NET automatically perceives the BOM when the file is opened and then uses the code obtained according to the BOM to open the file, if there is no BOM and then open the file with the user-specified coding area, if the user does not specify the encoding, then use UTF8 encoding.

This "auto-aware BOM" parameter can be set in the StreamReader constructor, corresponding to the Detectencodingfrombyteordermarks parameter.

However, it cannot be set in the corresponding method of the file class. (For example: File.readalltext).

For example, the following code, respectively, using:

GB2312 code, automatically detect BOM to read GB2312 text

GB2312 encoding to automatically detect the BOM to read Unicode text

GB2312 encoding, without perceiving the BOM to read Unicode text

static void Main ()

{

var gb2312 = Encoding.GetEncoding ("GB2312");

// Encode with GB2312, automatically detect BOM to read GB2312 text

ReadFile ("gbk.txt", gb2312, true);

// Encode with GB2312, automatically detect BOM to read Unicode text

ReadFile ("unicode.txt", gb2312, true);

// Encode with GB2312 without reading BOM to read Unicode text

ReadFile ("unicode.txt", gb2312, false);

}

// Read text via StreamReader

static void ReadFile (string path, Encoding enc, bool detectEncodingFromByteOrderMarks)

{

StreamReader sr;

using (sr = new StreamReader (path, enc, detectEncodingFromByteOrderMarks))

{

Console.WriteLine (sr.ReadToEnd ());

}

Output:

A Liu a Liu???

The third line is garbled.

Seeing above, using GB2312 encoding to open Unicode files will also succeed. Because the "auto-aware BOM" parameter is True, when the file is found to have a BOM,. NET detects a Unicode file through the BOM and then uses Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, obviously there is no BOM, so you must specify GB2312 encoding, otherwise. NET will use the default UTF8 encoding to parse the file, it is unable to read the results. Garbled in the third line is due to "automatic awareness of the BOM" is false. NET reads a Unicode encoded text file with a BOM directly with the specified GB2312 encoding, which is obviously not successful.

Of course, you can also judge the BOM, if there is no BOM, specify a default encoding to open the text. I've written in a previous article (. NET (C #): Sensing encoding from a file).

Code:

static void Main ()

{

PrintText ("gb2312.txt");

PrintText ("unicode.txt");

}

// Automatically detect the encoding and output the content according to the file

static void PrintText (string path)

{

var enc = GetEncoding (path, Encoding.GetEncoding ("GB2312"));

using (var sr = new StreamReader (path, enc))

{

Console.WriteLine (sr.ReadToEnd ());

}

/// <summary>

/// try to return the character encoding based on the file

/// </ summary>

/// <param name = "file"> file path </ param>

/// <param name = "defEnc"> No default encoding returned by BOM </ param>

/// <returns> If the file cannot be read, return null. Otherwise, it returns the encoding determined by the BOM or the default encoding (no BOM). </ returns>

static Encoding GetEncoding (string file, Encoding defEnc)

{

using (var stream = File.OpenRead (file))

{

// Judge the stream is readable?

if (! stream.CanRead)

return null;

// Byte array stores BOM

var bom = new byte [4];

// The actual length read

int readc;

readc = stream.Read (bom, 0, 4);

if (readc> = 2)

{

if (readc> = 4)

{

// UTF32, Big-Endian

if (CheckBytes (bom, 4, 0x00, 0x00, 0xFE, 0xFF))

return new UTF32Encoding (true, true);

// UTF32, Little-Endian

if (CheckBytes (bom, 4, 0xFF, 0xFE, 0x00, 0x00))

return new UTF32Encoding (false, true);

}

// UTF8

if (readc> = 3 && CheckBytes (bom, 3, 0xEF, 0xBB, 0xBF))

return new UTF8Encoding (true);

// UTF16, Big-Endian

if (CheckBytes (bom, 2, 0xFE, 0xFF))

return new UnicodeEncoding (true, true);

// UTF16, Little-Endian

if (CheckBytes (bom, 2, 0xFF, 0xFE))

return new UnicodeEncoding (false, true);

}

return defEnc;

}

// Helper function to determine the value in bytes

static bool CheckBytes (byte [] bytes, int count, params int [] values)

{

for (int i = 0; i <count; i ++)

if (bytes [i]! = values [i])

return false;

return true;

}

In the above code, for Unicode text, the GetEncoding method returns UTF16 encoding (more specifically: the UTF16 encoding of big or Little-endian is also returned based on the BOM), and a file without a BOM returns the default GB2312 encoding.

. NET (C #): Sensing encoding from a file

. NET (C #): Character encoding (Encoding) and byte order mark (BOM)

. NET (C #): Using the System.Text.Decoder class to process "stream text"

. NET (C #): Talking about the assembly manifest resource and the RESX resource

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More