File encoding on Microsoft platform is compatible with Unix and does not generate BOM headers. unixbom

Source: Internet
Author: User

File encoding on Microsoft platform is compatible with Unix and does not generate BOM headers. unixbom

Encountered a problem ,. when HTML is generated in the NET background on Linux, a line of Garbled text will appear, and the style will be messy. NET runs on a windows platform, a bom header is automatically added to the generated UTF-8.

 

The key code is removed from the BOM.

System. Text. UTF8Encoding utf8 = new System. Text. UTF8Encoding (false );
StreamWriter sw = new StreamWriter (nFile, utf8 );

The following two files are removed and not removed. ef bb bf is the BOM header.

 

private bool FileStreamWriteFile(Model.RecommendHtml model)        {            try            {                string writeUrl = ConfigurationManager.AppSettings["unix21"];                string htmlurl = writeUrl + @"\html\" + model.ID + ".html";                FileStream nFile = new FileStream(htmlurl, FileMode.OpenOrCreate, FileAccess.ReadWrite);                nFile.Seek(0, SeekOrigin.Begin);                nFile.SetLength(0);                 System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding(false);                StreamWriter sw = new StreamWriter(nFile,utf8);                sw.Write(model.RecommendContent);                sw.Close();                nFile.Close();                return true;            }            catch (Exception ex)            {                return false;            }        }


 

 

References for UTF-8 and BOM headers:

The UTF-8 does not require BOM, although Unicode standards allow BOM to be used in the UTF-8.
So does not include BOM UTF-8 is the standard form,It is Microsoft's habit to place BOM in a UTF-8 File(By the way: the small-end UTF-16 with BOM is called "Unicode" without a detailed description, which is Microsoft's habit ).
BOM (byte order mark) is prepared for the UTF-16 and UTF-32, used to mark the byte order ). Microsoft uses BOM in UTF-8 because it can clearly distinguish UTF-8 from ASCII codes, but such files will cause problems in operating systems outside of Windows.

 

In fact, BOM is not a bad habit. BOM is also part of the Unicode standard and has a specific applicability. Usually bomis used to mark the unicodepure character stream, used to identify a convenient character processing program reading the. txt file which is Unicode encoding (UTF-8, UTF-16BE, UTF-16LE ). Windows processes BOM better because it integrates Unicode recognition codes into APIs, mainly CreateFile (). When a text file is opened, it automatically identifies and removes the BOM. This is a historical reason for using Windows because it was originally originated from a multi-code-page environment (ANSI environment ). When Unicode is introduced, Windows designers hope to be able to be compatible with Unicode and non-Unicode (Multiple byte) text files without your attention, so they can only use this small trick. In contrast, Linux systems such as Linux have a short deployment time in Multi-locale environments. In addition, the Community itself has enough power to move forward with light load (spof: microsoft's requirements for compatibility is indeed a very paranoid point, any point undermine the compatibility of the practice is not allowed, so many times is bound to their own hands), so simply one step into the UTF-8. Of course, there is a transitional period in the middle, such as from the initial full UTF-8 of GTK + 2.0 released to basically all GTK developers are not using multiple locale GTK + 1.2, I have been there for at least three to four years.

BOM is not popular in UNIX environments, because many UNIX programs do not bird BOM.The main problem lies in the first line of all the scripting languages of UNIX #! This depends on shell parsing. Many shells do not check BOM for compatibility reasons. Therefore, when adding BOM, shell will interpret it as a common character input, causing damage #! Mark, this is troublesome. In fact, many modern scripting languages, such as Python, can process BOM in their interpreters themselves, but shell is stuck here, there is no way, you can only lie down and shot. This cannot be blamed on shell, because BOM itself violates a Common UNIX design principle, that is, the data in the document must be visible. BOM cannot be edited as visible characters in the text editor, which is not satisfactory to many UNIX developers.

 

Http://www.cnblogs.com/findumars/p/3620078.html

========================================================== =====

Q: What is a BOM?

A: UTF-8 files can be divided into two formats: no BOM and BOM.

What is BOM? "Ef bb bf" these three bytes are called BOM. The full name of BOM is "Byte Order Mard ". in UTF-8 files, BOM is often used to indicate that this file is a UTF-8 file, and BOM is really utf16 used to represent the high and low byte sequence.

Prior to the byte stream, BOM indicates that the low byte sequence is used (the low byte is at the front), while utf8 does not need to consider the byte sequence, so it is possible to have BOM.

 

Remove the BOM signature using the following methods:

Code

System. Text. UTF8Encoding utf8 = new System. Text. UTF8Encoding (false );
StreamWriter stream = new StreamWriter (Server. MapPath ("normren.html"), false, utf8 );
Stream. Write ("Content ");
Stream. Close ();

 

 

// In the past, someone seems to have to rewrite utf8 so that it does not generate a flag. You don't need to do that. The system has provided related functions.
StreamWriter dout = new StreamWriter ("1.html", false, new UTF8Encoding (false ));
Dout. Write ("sdsdsd ");
Dout. Close ();

Reference: http://blog.163.com/yanfeng_0/blog/static/6200414520096303911545/

 

========================================================== ============

BOM (Byte Order Mark) is the standard Mark used in the UTF Encoding scheme to Mark the encoding. In the UTF-16, It is ff fe, and the UTF-8 becomes ef bb bf. This flag is optional because UTF8 bytes are not sequential, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this kind of detection, but some software does not do this kind of detection, and treats it as a normal character.

Microsoft added ef bb bf three bytes before its own text file in UTF-8 format, the notepad and other programs on windows are based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only a mark by Microsoft, other platforms do not make such a mark on UTF-8 text files.

That is to say, a UTF-8 file may have BOM, there may be no BOM, so how to distinguish? Three methods. 1, open the file with a UltraEdit-32, switch to the hexadecimal editing mode, check whether the file header ef bb bf. 2. Open it with Dreamweaver and check the page properties to see if there is a check mark before "including Unicode signature BOM. 3, open with Windows notepad, select "Save as", see the default file encoding is UTF-8 or ANSI, if it is ANSI without BOM.

Reference: http://blog.163.com/result_2205/blog/static/13981945020102954023564/

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.