I. Introduction
UTF-8 is a Unicode character encoding method that is often used in web applications. The advantage of using UTF-8 is that it is a variable length encoding method, the length of the ANSII code is 1 byte. In this way, network bandwidth can be greatly reduced when a large number of ASCII character sets of webpages are transmitted.
UTF-8 signature, also known as BOM (Byte Order Mark), is the standard tag used for identification encoding in the UTF Encoding scheme. BOM is the standard mark used in the UTF Encoding scheme to mark the code, which is ff fe in the UTF-16 and becomes ef bb bf. This flag is optional because UTF8 bytes are not sequential, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this kind of detection, but some software does not do this kind of detection, and treats it as a normal character. Microsoft added ef bb bf three bytes before its own text file in UTF-8 format, the notepad and other programs on windows are based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only a mark by Microsoft, other platforms do not make such a mark on UTF-8 text files. That is to say, a UTF-8 file may have a BOM, or it may have no BOM.
There is only one BOM. If signatures are set for multiple files, multiple UTF-8 signatures are contained in the binary stream, which is the cause of the "root element must be well-formed" Failure in XML Conversion.
Ii. View and convert
Since a UTF-8 file may have a BOM or no, how should we differentiate it?
As long as you use software with hexadecimal editing, for example, open the file with a UltraEdit-32, switch to hexadecimal editing mode, check whether the file header has ef bb bf. Yes, the BOM mode is used.
Windows built-in notepad, save as a UTF-8, the default with BOM.
There are many conversion methods, common UltraEdit-32 or NotePad ++ can be, take UltraEdit-32 as an example. After opening the file, select "Save as". The following options are available in the "format" column:
In addition, DreamWeaver CS3 has similar options. In preferences, if Unicode (UTF-8) is selected as the default encoding, you can select the include Unicode signature (BOM) option, in order to include the byte sequence mark (BOM) in the document ). Otherwise, BOM is not included:
3. Other knowledge
Learn from the http://blog.csdn.net/thimin/archive/2007/08/03/1724393.aspx article:
The so-called unicode file is actually a UTF-16, but it is exactly the same as the unicode code, but in terms of concept, unicode and utf are two different, unicode is a memory encoding representation scheme, utf saves and transfers unicode. The UTF-16 is also divided into two types: High Front (LE) and high behind (BE. The official utf Code also includes utf-32, which can be le and BE. Non-unicode official utf Encoding also has utf-7, mainly used for mail Transmission. The single-byte part of UTF-8 is compatible with the iso-8859-1, which is primarily forced out of some old systems and library functions that cannot properly handle the UTF-16, and for English characters, it also saves storage space (at the cost of non-English characters wasting space ). In the iso-8859-1, both utf8 and iso-8859-1 are represented in one byte, and when it represents other characters, UTF-8 uses two or three bytes.
Here is a more detailed description of BOM:
There is a character named "zero width no-break space" in the UCS encoding, and its encoding is FEFF. FFFE does not exist in the UCS, so it should not appear in actual transmission. We recommend that you transmit the character "zero width no-break space" before transmitting the byte stream in the UCS specification ". In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore, the character "zero width no-break space" is also called BOM.
The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "zero width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.
Windows uses BOM to mark the encoding of text files.
PHP does not support BOM either.
PHP did not consider the BOM issue during design, that is, he would not ignore the three characters at the beginning of the BOM in a UTF-8-encoded file. Because it must be in the <? Or <? The code after php will be executed as PHP code, so these three characters will be output directly. If the plug-in file has this problem, it will display a white screen after activating or not activating the plug-in on the background page. If it is a template file, these three characters are output directly, resulting in a small empty line at the top of the page. English plug-ins and templates in foreign countries are generally encoded using ASCII codes without BOM. Only domestic plug-ins and templates may cause problems without the author's knowledge. Also, when you modify the template, because the output page uses UTF-8 encoding, then modify the template if there are added Chinese characters, you must convert the file into UTF-8 encoding to display normally, in this case, if the BOM is automatically added to the editor, the three characters will be output on the page. The display effect depends on the browser, which is generally a blank line or garbled code.
※Note: when using php to import a template, browsing exceptions are more likely due to these three characters.