About string encoding in. NET Development

Source: Internet
Author: User

Note:

In the basic article 13.2.1 serialization and stream in. NET 4.0 object-oriented programming, we will introduce how to serialize an object to a stream.

This article will introduce the serialization of string objects. The key here is how to encode and decode strings into binary values so that they can be saved to the FileStream, alternatively, they can be remotely sent to another computer through NetworkStream.

Complain:

Using CSDN's online editor to write articles is a daunting task. When submitting documents, the CSDN Web Server often reports "internal errors", so the article layout is poor. Sorry. Blog park system stability, readers can access html "> http://www.cnblogs.com/bitfan/archive/2010/11/25/1887590.html to see a better typographical of the same article.

 


========================================================== ======================================

1 Guide

In actual development, you often need to write some strings to text files, or read strings from text files. in the. NET application, StreamReader or StreamWriter are usually used to do this. For example, the following code writes the fileContent string to the FileName file:

Static void WriteFileUseStreamWriter (String fileContent, String FileName)
{
Using (StreamWriter writer = new StreamWriter (FileName ))
{
Writer. Write (fileContent );
}
}

If you use the related classes in the. NET base class library (such as StreamReader or the File class used below) to read this File, you will find everything works as you wish:

WriteFileUseStreamWriter ("China AB", "test.txt ");
Console. WriteLine (File. ReadAllText ("test.txt"); // output: "China AB"

In most cases, we work in a Chinese Windows system, and it is usually. NET program write, another. NET program read, so, a lot. NET programmers may not have noticed that there is actually a character encoding problem. in a specific situation, this problem will cause us trouble.

See Figure 1:

 


Figure 1 encoding supported by notepad

By default, Windows notepad saves files in ANSI encoding mode. 1. If the text content is "China AB" and notepad saves it as "test.txt" in ASNI mode, the following code will "strike" (see Figure 2 ):

Console. WriteLine (File. ReadAllText ("test.txt "));


Figure 2 Chinese characters are garbled

As shown in figure 2, when the File. ReadAllText method opens the "test.txt" File, English characters can be displayed normally, but Chinese characters are garbled.

2. Understand character encoding

We can make an experiment, use NotePad to save the Chinese and English character strings of "China AB" in different encoding methods into multiple ". txt" files, and then directly view their binary content:

 

Figure 3 Comparison of character encoding

 


Figure 3 shows the different binary data obtained by "China AB" in four encoding methods (ANSI, UTF8, Unicode, and Unicode Big Endian.

Take the English character "a" as an example. The numbers produced by ANSI and UTF8 are both "61 ", but Unicode extended it to a 2-byte 16-bit binary ("61 00" and "00 61"), so we call this encoding method A UTF-16.

UTF-16 can be subdivided into two encoding methods: Big Endian mode and Little_Edian mode, the only difference between the two is that the byte order is just the opposite, the Little_Edian method encodes "a" into "61 00", while the Big Endian method is encoded as "00 61 ".

Now let's take a look at the Chinese character. The Chinese character "China" has two Chinese characters, and the ANSI code is "D6 D0 B9 FA". Four bytes. One Chinese Character occupies two bytes, UTF8 is encoded as "E4 B8 AD E5 9B BD", with 6 bytes. One Chinese Character occupies 3 bytes! This indicates that UTF8 is a variable-length code, which may use 1 ~ 4 bytes to indicate a character.

In addition, we can see that UTF8 and Unicode encoding (whether Big Endian or Little Endian) are preceded by several markup characters, which are placed at the beginning of a text file, known as "BOM (Byte Order Mark, indicates the encoding method of the text. the BOM values of common character encoding methods in the. NET program:

 

Encoding
BOM Value
 
UTF-8
EF BB BF
 
UTF-16 big endian
FE FF
 
Little endian UTF-16
FF FE
 
UTF-32 big endian
00 00 FE FF
 
Little endian UTF-32
Ff fe 00 00
 


 

 

After understanding the basic knowledge above, we can automatically check the encoding method of the string based on the BOM value to correctly decode the binary data stream. The following code checks whether the text binary data is UTF-8 encoded:

 
// Open the file and read binary data
Byte [] FileContents = File. ReadAllBytes (FilePath );
Int filelength = FileContents. Length;
// Check BOM
If (FileContents [0] = 0xef & FileContents [1] = 0xbb & FileContents [2] = 0xbf)
{
// Press UTF8 to decode the string. Eliminate the three bytes occupied by BOM.
String content = Encoding. UTF8.GetString (FileContents, 3, filelength-3 );
Console. WriteLine (content );
}

Other encoding methods can be "Image Based on samples ".

3. Explanation of classes related to character encoding in the. NET base class library

The Encoding class in the preceding code is the core type for. NET character Encoding and decoding. Figure 4 shows its attributes:

 

Figure 4 Encoding type

4. The Encoding type provides UTF-8, Unicode, and other encodings and decoders, and calls its Get methods to complete Encoding and decoding. The following is the sample code:

 
// Encoding
Byte [] bytes = Encoding. UTF8.GetBytes ("China AB ");
Foreach (byte value in bytes)
Console. Write ("{0}", value. ToString ("x"); // convert to hexadecimal
Console. WriteLine ();
// Decoding
Char [] chars = Encoding. UTF8.GetChars (bytes );
Foreach (char ch in chars)
Console. Write ("{0}", ch );
The running result is as follows:

 

Figure 5 encoding and decoding

Note that the preceding binary values do not include BOM.

In fact, StreamWriter in. NET uses UTF8 encoding format to encode strings by default, but does not write the BOM value ("ef bb bf") corresponding to UTF8 into the binary stream. The following is a constructor declaration of StreamWriter:

Public StreamWriter (string path): this (path, false, UTF8NoBOM, 0x400)
{}
Similarly, the File. ReadAllText () method internally uses UTF8 to read strings in the specified File:

Public static string ReadAllText (string path)
{
//......
Return InternalReadAllText (path, Encoding. UTF8 );
}
Because the default encoding method is the same, StreamWriter and F are used together.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.