C # encoding (II). Net (C #): encoding and BOM)

Source: Internet
Author: User
Encoding usage

Encoding is easy to use. If only bytes and characters are converted to each other, the getbytes () and getchars () Methods and Their overloading will basically meet all your requirements.

Getbytecount () and its overload are the actual number of bytes when a string is converted into a node.

Getcharcount () and its overload are used to obtain the size of a byte array converted to a string.

Note the two methods: int getmaxbytecount (INT charcount); int getmaxcharcount (INT bytecount );

It is not what you expect. If it is a single byte, charcount is returned. If it is a double byte, chartcount * 2 is returned, but chartcount + 1, (chartcount + 1) * 2.

            Console.WriteLine("The max byte count is {0}.", Encoding.Unicode.GetMaxByteCount(10));            Console.WriteLine("The max byte count is {0}.", Encoding.ASCII.GetMaxByteCount(10));

The above results are 22 and 11, respectively, instead of 20, 10. I found the reason in an English blog, I am not good at English, do not understand what is high surrogate and low surrogate: http://blogs.msdn.com/ B /shawnste/archive/2005/03/02/383903.aspx

For example, encoding. getencoding (1, 1252 ). getmaxbytecount (1) returns 2. 1252 is a single byte code page (encoding), so generally one wocould could expect CT that getmaxbytecount (n) wocould return N, but it doesn't, it usually returns n + 1.

One reason for this oddity is that an encoder cocould store a high surrogate on one call to getbytes (), hoping that the next call is a low surrogate. this allows the fallback mechanic to provide a fallback for a complete surrogate pair, even if that pair is split between cballs to getbytes (). if the fallback returns? For each surrogate half, or if the next call doesn't have a surrogate, then 2 Characters cocould be output for that surrogate pair. so in this case, calling encoder. getbytes () with a high surrogate wocould return 0 bytes and then following that with another call with only the low surrogate wocould return 2 bytes.

The following code is a simple application of encoding. You can print the result and then combine it with the previous article.

        static void Output(Encoding encoding,string t)        {            Console.WriteLine(encoding.ToString());            byte[] buffer = encoding.GetBytes(t);            foreach (byte b in buffer)            {                Console.Write(b + "-");            }            string s = encoding.GetString(buffer);            Console.WriteLine(s);        }
String strtest = "test my notebook A has K"; console. writeline (strtest); output (encoding. getencoding ("gb18030"), strtest); output (encoding. default, strtest); output (encoding. UTF32, strtest); output (encoding. utf8, strtest); output (encoding. unicode, strtest); output (encoding. ASCII, strtest); output (encoding. utf7, strtest );
About BOM

The full name of BOM is byte order mark, which is a byte sequence mark. It is a binary string used to identify the encoding of a text. For example, when you open a text with notepad, if this section of BOM is included in the text, it can determine which encoding method is used and use the corresponding decoding method to properly open the text without garbled characters. Without this Bom, notepad will be opened in ANSI by default, which may be garbled. You can use the encoding method getpreamble () to determine whether there is a BOM for this encoding. Currently, only the following five encoding methods in CLR have Bom.

UTF-8: EF BB BF

UTF-16 big endian: Fe FF

Little endian: FF Fe UTF-16

UTF-32 big endian: 00 00 Fe FF

Little endian: FF Fe 00 UTF-32

The encoding constructed using the static properties Unicode, utf8, and UTF32 of encoding all carry bom by default. If you want to write a text (such as an XML file, if Bom exists, without Bom, you must use their instances,

Encoding encoding = new unicodeencoding (false, false); // The second parameter must be falseencoding encodingutf8 = new utf8encoding (false); encoding encodingutf32 = new utf32encoding (false, false ); // The second parameter must be false.

For the relationship between reading and writing texts and Bom, refer to this blog in the garden. I will not repeat it in detail ,. net (C #): encoding and BOM)

  Determine the encoding method of a text

If a text is given, we do not know its encoding format. How do we choose encoding for decoding? The answer is to determine which Unicode is based on BOM. if Bom is not available, it is hard to say that this is based on the source of the text file, which is generally used by encoding. default, which returns different values based on the current settings in your computer. If your file is from an international friend, you 'd better decode it with a UTF-8. The following code does not guarantee the correctness of a specified file without Bom. If you want to use it in your project, pay attention to this.

/// <summary>        ///Return the Encoding of a text file.  Return Encoding.Default if no Unicode        // BOM (byte order mark) is found.        /// </summary>        /// <param name="FileName"></param>        /// <returns></returns>        public static Encoding GetFileEncoding(String FileName)        {            Encoding Result = null;            FileInfo FI = new FileInfo(FileName);            FileStream FS = null;            try            {                FS = FI.OpenRead();                Encoding[] UnicodeEncodings =                {                     Encoding.BigEndianUnicode,                     Encoding.Unicode,                    Encoding.UTF8,                    Encoding.UTF32,                    new UTF32Encoding(true,true)                };                for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)                {                    FS.Position = 0;                    byte[] Preamble = UnicodeEncodings[i].GetPreamble();                    bool PreamblesAreEqual = true;                    for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)                    {                        PreamblesAreEqual = Preamble[j] == FS.ReadByte();                    }                    // or use Array.Equals to compare two arrays.                    // fs.Read(buf, 0, Preamble.Length);                    // PreamblesAreEqual = Array.Equals(Preamble, buf)                    if (PreamblesAreEqual)                    {                        Result = UnicodeEncodings[i];                    }                }            }            catch (System.IO.IOException ex)            {                throw ex;            }            finally            {                if (FS != null)                {                    FS.Close();                }            }            if (Result == null)            {                Result = Encoding.Default;            }            return Result;        }

To be continued ....

The next section focuses on encoder and decoder.

By the way, when editing a blog, I looked at pretty articles. How can I preview a blog without having many formats? Ugly

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.