The UTF-8 does not require a BOM, although the Unicode standard allows the use of BOMs in UTF-8.
So UTF-8 without BOM is the standard form, in the UTF-8 file to place the BOM is mainly Microsoft's habit (by the way: the small end-order with the BOM UTF-16 called "unicode" And not in detail, this is Microsoft's habit ).
the BOM (byte order mark) is prepared for UTF-16 and UTF-32 and is used to mark the byte order (byte order). Microsoft uses the BOM in UTF-8 because it can distinguish between UTF-8 and ASCII encoding, but such files can be problematic in operating systems outside of Windows.
The difference between "utf-8" and the utf-8"with BOM is that there is no BOM. There is no U+feff at the beginning of the file.
UTF-8 's web page code should not use a BOM, or it will often go wrong. This is a small example: why this page code
With the Unicode standard, Version 6.0, 3.10 D95 UTF-8 encoding scheme paragraph:
While there was obviously no need for a B Yte order signature When using UTF-8, there is occasions when processes convert UTF-16 or UTF-32 data containing a byte o Rder mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. It usage at the beginning of a UTF-8 data stream was neither required nor recommended by the Unicode standard, and its pre Sence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication That's the data stream is using the UTF-8 encoding scheme.
Http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
---------------------------------------------------------
First, the BOM is what. This is not explained, Wikipedia is very detailed. Http://en.wikipedia.org/wiki/Byte_order_mark.
Using a BOM on a Web page is an error. The BOM is not designed to support HTML and XML. To recognize text encoding, HTML has charset attribute, XML has encoding attribute, no need to pull BOM support scene. While the theoretical BOM can be used to identify UTF-16 encoded HTML pages, few people actually work on it. After all UTF-16 this encoding even ASCII is double-byte, is not used to do Web pages.
In fact, the BOM is a bad habit is not necessarily. The BOM is also part of the Unicode Standard and has its specific scope of application. Typically, a BOM is used to mark Unicode text-only byte streams to provide a convenient way for a text processor to recognize which Unicode encoding (UTF-8,UTF-16BE,UTF-16LE) of a. txt file is read. windows is relatively good at BOM processing because Windows integrates Unicode recognition code into the API, mainly CreateFile (). When you open a text file, it automatically recognizes and rejects the BOM. Windows uses this for historical reasons, because it was originally born out of a multi-code page environment (ANSI Environment). When introducing Unicode, Windows designers also want to be able to be compatible with both Unicode and non-Unicode (multiple byte) text files without the user's attention, only with this small trick. In contrast, systems such as Linux are less time-soaked in multi-locale environments, and the community itself has enough power to move light (spit slots: Microsoft's requirements for compatibility are really paranoid, and any breach of compatibility is not allowed, So that most of the time they are tied to their own hands, so simply one step into the UTF-8. Of course there is a transition period, such as from the original full UTF-8 gtk+2.0 released to basically all GTK developers have abandoned multi-locale gtk+1.2, I have been in the impression of at least 3-4 years.
BOM is not popular primarily in UNIX environments, because many UNIX programs do not bird BOMs. The main problem is the first line of the Unix script language #!, which relies on shell parsing, and many shells do not detect the BOM for compatibility reasons, so when added to the BOM, the shell interprets it as a normal character input causing damage to the #! flag . , it's in trouble. In fact, many modern scripting languages, such as Python, the interpreter itself is able to handle the BOM, but the shell card here, no way, can only lie in the gun. That's not to blame for the shell, because the BOM itself violates the common principle of a UNIX design, which is that the data that exists in the document must be visible. The BOM cannot be edited as a visible character by a text editor, which is not satisfactory for many UNIX developers.
By the way, even if the scripting language can handle the BOM, using the BOM anywhere is not a recommended approach. Each scripting language has its own set of Unicode processing, and Python's #-*-coding:utf-8-*-,perl's use UTF8 is simpler and more reliable than the BOM. Another good news is that even friends who have to switch between Windows and UNIX will not be sad. Fortunately, in the UNIX environment we also have vim this artifact, even if we encounter the BOM block, we can also through set Nobomb; Set Fileencoding=utf8; The W three command solves the problem.
Finally, looking back, it seems that only Windows insists on using the BOM.
P.S.: This question is your 150th answer. Suddenly found that they answered very little??
P.S. 2: Suddenly think of the need to explain why vim removal bomb operation needs to be done under UNIX. Because Vim has a strange bug in the Windows environment, it always recognizes the UTF-16 file as a binary file, while UNIX (Linux or Mac) can have no problem with vim. This question from Vim 6.8 has followed me to vim 7.3. It is unclear whether this is a bug in Vim or my own. vimrc file. It is appreciated if you have a master answer.
---------------------------------------------------------
There is a BOM format that will have 3 bytes in the beginning of the EF BB BF, primarily for identification coding. The BOM should be Windows-specific, in the production of web pages will produce a variety of unexpected problems, such as the output of a blank line, affecting the PHP session or the function of the cookie (header already sent error), It may even cause the page to be garbled (that 3 bytes affects the browser's handling of page encoding), so it is always recommended to use no BOM encoding. I even wrote a batch-processing PHP script to deal with this problem.
---------------------------------------------------------
Di Qiang, software development ing
Zhang Xudong, Mingyue Zhou, Sapjax agree
A few weeks ago, I was still worried about the BOM problem ...
As @ Myongnyang said, "No BOM UTF-8 is the standard form", it is true, no BOM use more, so individuals or recommend the general situation in the form of non-BOM, unless there is a problem, then consider changing the BOM. the Windows system is saved with BOM, so you can see, with Notepad to save a UTF-8 txt, in fact, there is a BOM, this needs to be noted. In addition, different text editor for the name of the BOM is also slightly different, such as EditPlus, there is a BOM called utf-8+, no BOM is called UTF-8, and in notepad++, there is a BOM is called Standard UTF-8, Without BOM, it is called UTF-8 without BOM.
---------------------------------------------------------
Dragon Fly, C + + programmer, like astronomy, mathematics and psychology.
Weijing Huang, Bill Chan, icky R agree
The following text is from my blog content, from the other side to explain the different BOM.
---------------------------------------------------------
Character encoding is believed to be the nightmare of every programmer, as long as there is a Chinese language, always encounter a variety of coding problems, and this problem is very difficult, especially on Linux, because many of the above software is developed for the English-speaking countries, is not to consider other language coding problems. After encountering countless great pits of coding, I decided to study the coding problem carefully, because it is like a barrier that is always in front of you, and every time you come here you will fall, and every time you get up, you will be indifferent, such people are called warriors, real warriors. Pity is a force warrior, as a new age of intelligence fighters, of course, can not fall in that and then continue to fall in this.
How files are stored:
Files have their own storage formats, such as the most common txt,cpp,h,c,xml, PNG, RMVB formats, and custom formats. These files, regardless of the format, are stored in the computer's hard disk in the 2 binary storage, corresponding to different file formats, there are different software parsing. This article does not talk about how files are stored, only how the files are parsed.
Text File parsing:
text files correspond to human readable text, how to convert from 2 to a text file? At first due to the invention of computers in the United States, it is natural to consider English how to express, the English alphabet total 26, plus special characters, 128 characters, 7 bit both a byte can be expressed. This is what everyone knows about Ascill coding. The correspondence is simple and one character corresponds to one by one byte.
But soon found that other non-English-speaking countries far more than Ascill code, this time we would like to unify, different countries out of their own different coding, China's gb2312 is to do their own coding, so that each country has its own code, back and forth is too troublesome. At this time, a new encoding method, Unicode encoding, want to unify the code, so the corresponding Unicode code is specified for each character.
1, many files are ASCII encoding, if using Unicode too wasteful.
2. There is no flag indicating the number of bytes to parse into a symbol.
Then the UTF that saved the world appeared, UTF is an implementation of Unicode, but smarter. The UTF16 is occupied by two bytes, or four bytes, and Utf32 is occupied by four bytes. UTF8 is a very smart way to express yourself.
1, for single-byte symbols, the first bit of byte is 0, and the next 7 bits represent the byte encoding.
2, for the N-byte symbol, the first n bits of byte are set to 1, the n+1 bit is 0, and the rest bits are encoded.
For different encodings, there are different flags at the very front of the text, Unicode usually has two bits to represent the FF Fe, or Feff, Fffe represents the Big-endian encoding Feff represents Litte-endian encoding. UTF8 is the beginning of EFBBBF. It can be seen that utf-8 is self-explanatory, so it is not necessary to bring this symbol file, most programs can be identified. However, some programs do not recognize this flag, such as PHP will directly put this flag as text parsing, not ignored. Believe that a lot of the PHP output text parsing garbled or parsing errors of the students have encountered such problems.
Finally say how to remove or add BOM, if there is vim that the best, remove the command:
Set Encoding=utf-8
Set Nobomb
Add Command:
Set Encoding=utf-8
Set bomb
---------------------------------------------------------
The UTF-8 with the BOM is the naked rogue!!!!!!!!!
Windows is always self-made smart to do something others can't understand!!! UTF-8 is not required for BOM header ~~~!!
From the beginning to learn the code (really can not call me to do things for the program) to now, do not know how many times the BOM head, especially for me this completely self-taught people, know how long it takes to find a bug????
With without BOM head difference lies in this BOM head, Xiang see the top of the big God answer. A wonderful flower peculiar to Windows. Please use UTF-8 without BOM head!!
It produces bugs that include, but are not limited to:
Nobelium--Thank you for flying to provide, reference its answer
HTML Blank Line
Momin interval between div
Garbled!
If you use SSL then there must be a problem!!!
By the way, I despise Sony's memory stick, iphone interface ~ ~
Let's fold this stuff out of the groove.
--------------------------------------------------------
UTF-8 with BOM is very fucked up and often causes puzzling problems.
---------------------------------------------------------
I used it all. UTF-8 without BOM, with BOM often garbled
---------------------------------------------------------
notepad++ is automatically added as a utf8 with BOM pit daddy
---------------------------------------------------------
It is recommended that programmers can use MAC programming to use Mac,window as much as possible and their fucked-up operating system. Secondly, if we want to read the three-party file and parse in UTF-8 format, we must pay attention to determine whether the file has a BOM, for example: SQL file parsing execution.
---------------------------------------------------------
In the Web programming with the use of the BOM I do not say anything, because the software can not be used for reasons more useless.
Recently learning to use cocos2d-x, pure C + + encoding, if there are non-ASCII characters such as Chinese in the code appear. Found to be wrong. The code is written in Xcode under the Mac and placed under Windows with VS compilation.
Finally, all the source files into the format with the BOM after the compilation passed, link failure, this is not the problem of coding.
Generally, it is generally thought that when writing C + + code, do not use Chinese, but many times we programmers also want to look at their own comfort, for God horse can not write Chinese?
So in Windows wrote a helloworld.cpp type of file, the output in Chinese, and then save as Utf-8 with the BOM format, and then copy it to the Mac with g++ compiled, found that the success and can run normally, with Xcode open source file is also normal display.
Therefore, it is recommended that the program to run on Windows and Mac and Linux, the source code is best to save the Utf-8 with the BOM format, this is more common. And with utf-16 no matter big or small end, g++ are not recognized. or use Utf-8 without the BOM format, and then the code does not appear non-ASCII 127 characters later.
About saying that utf-8 without BOM is the standard, I think it should be with personal emotions. The real standard should be the BOM is optional, why optional? Because there are times when the BOM will be wrong, take a long history of windows, many countries are using Windows, and its files are made with their native ANSI code, such as the GBK and GB2013 of the mainland, RTHK's Big5, these codes because of the local characters used in the development, so, its storage file is small, so it will be heavily used, and there is a large number of, Microsoft can not consider the global billions of of the user's files and blindly modify the decoding method, and Microsoft is also one of the Uncode, so, The utf-8 with BOM is also in line with international standards.
Perhaps because of the author's personal reasons, perhaps considering the efficiency, a lot of programs can not correctly distinguish a utf-8 file is a BOM, so it led to a variety of garbled appearance.
Individuals do not want to say which is the standard, and do not want to use language to attack which company or group. Microsoft has nothing wrong with sticking to the BOM, because it's for the user to consider. It may be inconvenient for us to write programs, but the most widely used users of computers are not programmers.
---------------------------------------------------------
From
http://www.zhihu.com/question/20167122
UTF8 with BOM and without BOM (reprint)