C # encoding (II). Net (C #): encoding and BOM)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Encoding usage

Encoding is easy to use. If only bytes and characters are converted to each other, the getbytes () and getchars () Methods and Their overloading will basically meet all your requirements.

Getbytecount () and its overload are the actual number of bytes when a string is converted into a node.

Getcharcount () and its overload are used to obtain the size of a byte array converted to a string.

Note the two methods: int getmaxbytecount (INT charcount); int getmaxcharcount (INT bytecount );

It is not what you expect. If it is a single byte, charcount is returned. If it is a double byte, chartcount * 2 is returned, but chartcount + 1, (chartcount + 1) * 2.

            Console.WriteLine("The max byte count is {0}.", Encoding.Unicode.GetMaxByteCount(10));            Console.WriteLine("The max byte count is {0}.", Encoding.ASCII.GetMaxByteCount(10));

The above results are 22 and 11, respectively, instead of 20, 10. I found the reason in an English blog, I am not good at English, do not understand what is high surrogate and low surrogate: http://blogs.msdn.com/ B /shawnste/archive/2005/03/02/383903.aspx

For example, encoding. getencoding (1, 1252 ). getmaxbytecount (1) returns 2. 1252 is a single byte code page (encoding), so generally one wocould could expect CT that getmaxbytecount (n) wocould return N, but it doesn't, it usually returns n + 1.

One reason for this oddity is that an encoder cocould store a high surrogate on one call to getbytes (), hoping that the next call is a low surrogate. this allows the fallback mechanic to provide a fallback for a complete surrogate pair, even if that pair is split between cballs to getbytes (). if the fallback returns? For each surrogate half, or if the next call doesn't have a surrogate, then 2 Characters cocould be output for that surrogate pair. so in this case, calling encoder. getbytes () with a high surrogate wocould return 0 bytes and then following that with another call with only the low surrogate wocould return 2 bytes.

The following code is a simple application of encoding. You can print the result and then combine it with the previous article.

        static void Output(Encoding encoding,string t)        {            Console.WriteLine(encoding.ToString());            byte[] buffer = encoding.GetBytes(t);            foreach (byte b in buffer)            {                Console.Write(b + "-");            }            string s = encoding.GetString(buffer);            Console.WriteLine(s);        }

String strtest = "test my notebook A has K"; console. writeline (strtest); output (encoding. getencoding ("gb18030"), strtest); output (encoding. default, strtest); output (encoding. UTF32, strtest); output (encoding. utf8, strtest); output (encoding. unicode, strtest); output (encoding. ASCII, strtest); output (encoding. utf7, strtest );

About BOM

The full name of BOM is byte order mark, which is a byte sequence mark. It is a binary string used to identify the encoding of a text. For example, when you open a text with notepad, if this section of BOM is included in the text, it can determine which encoding method is used and use the corresponding decoding method to properly open the text without garbled characters. Without this Bom, notepad will be opened in ANSI by default, which may be garbled. You can use the encoding method getpreamble () to determine whether there is a BOM for this encoding. Currently, only the following five encoding methods in CLR have Bom.

UTF-8: EF BB BF

UTF-16 big endian: Fe FF

Little endian: FF Fe UTF-16

UTF-32 big endian: 00 00 Fe FF

Little endian: FF Fe 00 UTF-32

The encoding constructed using the static properties Unicode, utf8, and UTF32 of encoding all carry bom by default. If you want to write a text (such as an XML file, if Bom exists, without Bom, you must use their instances,

Encoding encoding = new unicodeencoding (false, false); // The second parameter must be falseencoding encodingutf8 = new utf8encoding (false); encoding encodingutf32 = new utf32encoding (false, false ); // The second parameter must be false.

For the relationship between reading and writing texts and Bom, refer to this blog in the garden. I will not repeat it in detail ,. net (C #): encoding and BOM)

Determine the encoding method of a text

If a text is given, we do not know its encoding format. How do we choose encoding for decoding? The answer is to determine which Unicode is based on BOM. if Bom is not available, it is hard to say that this is based on the source of the text file, which is generally used by encoding. default, which returns different values based on the current settings in your computer. If your file is from an international friend, you 'd better decode it with a UTF-8. The following code does not guarantee the correctness of a specified file without Bom. If you want to use it in your project, pay attention to this.

/// <summary>        ///Return the Encoding of a text file.  Return Encoding.Default if no Unicode        // BOM (byte order mark) is found.        /// </summary>        /// <param name="FileName"></param>        /// <returns></returns>        public static Encoding GetFileEncoding(String FileName)        {            Encoding Result = null;            FileInfo FI = new FileInfo(FileName);            FileStream FS = null;            try            {                FS = FI.OpenRead();                Encoding[] UnicodeEncodings =                {                     Encoding.BigEndianUnicode,                     Encoding.Unicode,                    Encoding.UTF8,                    Encoding.UTF32,                    new UTF32Encoding(true,true)                };                for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)                {                    FS.Position = 0;                    byte[] Preamble = UnicodeEncodings[i].GetPreamble();                    bool PreamblesAreEqual = true;                    for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)                    {                        PreamblesAreEqual = Preamble[j] == FS.ReadByte();                    }                    // or use Array.Equals to compare two arrays.                    // fs.Read(buf, 0, Preamble.Length);                    // PreamblesAreEqual = Array.Equals(Preamble, buf)                    if (PreamblesAreEqual)                    {                        Result = UnicodeEncodings[i];                    }                }            }            catch (System.IO.IOException ex)            {                throw ex;            }            finally            {                if (FS != null)                {                    FS.Close();                }            }            if (Result == null)            {                Result = Encoding.Default;            }            return Result;        }

To be continued ....

The next section focuses on encoder and decoder.

By the way, when editing a blog, I looked at pretty articles. How can I preview a blog without having many formats? Ugly

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # encoding (II). Net (C #): encoding and BOM)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C # encoding (II). Net (C #): encoding and BOM)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support