Preface
As we all know, computers can only recognize binary numbers, such as and. All the characters on the screen are converted to binary. Convert our text into binary files and store them on a computer according to certain rules. This process is called character encoding, and vice versa, it is decoding. Currently, there are multiple character encoding methods. A set of binary numbers may return different results based on different decoding methods, and sometimes even get garbled characters. This is why a webpage is sometimes garbled, and a text file is sometimes garbled, and a new encoding will return to normal. All characters in CLR are represented by 16-bit Unicode. The encoding in CLR is used for conversion between bytes and characters. For more information about character encoding, see charset and character encoding (charset & encoding)
About Encoding
The encoding in CLR is in the system. Text namespace. It is an abstract class, so it cannot be directly instantiated. It mainly has the following Derived classes:Asciiending, unicodeencoding, utf32encoding, utf7encoding, utf8encodingYou can select an appropriate encoding for encoding and decoding as needed. You can also call the static attribute ASCII, Unicode, UTF32, utf7, and utf8 of encoding to construct an encoding. Unicode indicates a 16-bit encoding. The following code calls static attributes and instantiates a subclass.
1 Encoding encodingUTF8 = Encoding.UTF8;2 Encoding encodingUTF8 = new UTF8Encoding(true);
Some simple descriptions of these types are as follows:
ASCII codeEncodes a 16-bit character into an ascii code. Only the 16 characters whose value is less than ox0080 can be converted into a single byte, that is, one character corresponds to one byte. When all characters are in the ASCII range (0x00 ~ 0x7f), you can use this encoding, It is very fast, suitable for English and American characters. This encoding is very limited and Chinese characters are converted into garbled characters. Corresponds to asciiendoing in CLR.
UTF-16Each character is encoded into 2 bytes, which does not affect the character or involve compression. The performance is very good, because the characters in CLR are also 16-bit Unicode. Corresponds to unicodeencoding in CLR.
UTF-324 bytes are encoded into one character. From the memory perspective, it is not a high-performance encoding scheme, because the first character is 4 bytes, especially memory occupied, so it is rarely used for encoding and decoding of files and network streams. Utf32encoding in CLR.
UTF-8Characters below ox0080 are compressed into one ASCII code. characters between 0x0080---0x07ff are converted into two characters, which is suitable for use in Europe and the Middle East. 0x0800 or above is converted to 3 characters, which is suitable for East Asian characters. The proxy item is converted to 4 bytes. Therefore, it is a very popular encoding and is suitable for the Internet. It is inefficient in UTF-16 for handling characters above 0x0800. Utf8encoding in CLR.
UTF-7This encoding is usually used in the old system. At that time, the system is represented by a 7-bit value. Unicode has been eliminated. Utf7encoding in CLR.
In terms of performance, if your code needs to call one encoding in multiple places, Microsoft recommends that you use static members to construct an encoding object instead of an instance. Its internal implementation is a singleton mode.
public static Encoding UTF8{ get { if (utf8Encoding == null) { utf8Encoding = new UTF8Encoding(true); } return utf8Encoding; }}
If you know the code page or name of an encoding code, you can call the static method getencoding (INT codePage) and getencoding (string name) of encoding to construct an encoding, for example, if we commonly use gb2312 to display simplified Chinese characters and its code page is 936, we can define it as follows:
Encoding encodingGB2312=Encoding.GetEncoding("gb2312");Encoding encodingGB2312=Encoding.GetEncoding(936);
Currently, there are dozens of text code pages, which correspond to different countries and languages. They only correspond to a part of the Unicode character set, such as 936, it only corresponds to the simplified Chinese part of the Unicode character set. If you want to correctly display traditional Chinese characters, you must use the code page 950 corresponding to traditional Chinese characters. For specific code pages, refer to the document in msdn or garden, C # text code page, and text code page Name Lookup Table.
The following code returns all encoding in CLR.
foreach (EncodingInfo eInfo in Encoding.GetEncodings()) { Console.WriteLine("Encoding code page is {0}, encoding name is {1}", eInfo.CodePage, eInfo.Name); Console.WriteLine("Encoding dispaly name is {0}", eInfo.DisplayName); }
The encoding object has a static property default, which also returns an encoding object. The language of the returned encoding depends on the settings in the --> Control Panel-> region and language on your computer, that is, ANSI. For example, if Chinses (simplified, PRC) is set in my computer, it corresponds to gb2312, so the following code prints gb2312. If your code is used in more than one country, you 'd better not encoding. Default. This will cause garbled characters. You 'd better use encoding. utf8.
Encoding encoding1 = Encoding.Default;Console.WriteLine(encoding1.WebName);
To be continued...
The next section describes how to use encoding, BOM, encoder, and decoder ....