First, Introduction
UTF-8 is an encoding of Unicode characters that are often used in Web applications, and the advantage of using UTF-8 is that it is a variable-length encoding, with a length of 1 bytes for ansii encoding, so that when a page with a large number of ASCII character sets is transmitted, Network bandwidth can be massively saved.
The UTF-8 signature (UTF-8 signature), also called the BOM (Byte Order mark), is a standard tag used to identify encodings in UTF encoding schemes. The BOM, which is the standard mark used in the UTF coding scheme, is the FF FE in UTF-16, and becomes the EF BB BF in the UTF-8. This tag is optional because the UTF8 byte is not in order, so it can be used to detect if a byte stream is UTF-8 encoded. Microsoft does this kind of testing, but some software does not do this and treats it as a normal character. Microsoft in its own UTF-8 format text file before adding the EF BB bf Three bytes, Windows above the Notepad and other programs based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only Microsoft secretly made the mark, Other platforms do not have such a mark on the UTF-8 text file. This means that a UTF-8 file may have a BOM, or it may not have a BOM.
There is only one BOM, there is no problem. If multiple files are signed, the binary stream contains multiple UTF-8 signatures, which is the "root element must be well-formed" cause of the XML conversion failure.
second, view and convert
Since a UTF-8 file may have a BOM, it may not, how to distinguish it?
Just use software with hexadecimal editing, for example, to open a file with UltraEdit-32, switch to hex edit mode, and see if there is an EF BB BF on the head of the file. There is a BOM mode.
Windows comes with a Notepad Notepad, which, when saved as UTF-8, comes with a BOM by default.
There are many methods of conversion, common UltraEdit-32 or notepad++ can be, take UltraEdit-32 as an example. After opening the file, select "Save As" and in the "format" column you have the following selections:
In addition, DreamWeaver CS3 has similar options, and in preferences, if you choose Unicode (UTF-8) as the default encoding, you can select the include Unicode signature (BOM) option to include a byte order mark (BOM) in your document. Otherwise, with no BOM:
third, other knowledge
Learn from the Http://blog.csdn.net/thimin/archive/2007/08/03/1724393.aspx article:
The so-called Unicode-saved file is actually utf-16, just like Unicode code, but conceptually Unicode is different from UTF, Unicode is a memory-encoded representation scheme, and UTF is a scheme for saving and transmitting Unicode. The utf-16 is also divided into high-top (LE) and high-post (be) two kinds. The official UTF code is also utf-32, also divided by Le and be. The non-Unicode official UTF encoding also has utf-7, which is mainly used for mail transmission. The single-byte portion of the UTF-8 is compatible with Iso-8859-1, which is mainly forced out by some old system and library functions that do not handle utf-16 correctly, and also saves the file space (at the expense of non-English characters wasted space) on the English character. In Iso-8859-1, UTF8 and iso-8859-1 are represented in one byte, and Utf-8 uses two or three bytes when representing other characters.
A more detailed description of the BOM, from here:
There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted. This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.
The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF. So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.
Windows uses a BOM to mark the way a text file is encoded.
PHP also does not support BOM.
PHP did not consider BOM at design time, that is to say, he will not ignore the UTF-8 encoded file at the beginning of the BOM three characters. Because of the need to ※ Add one sentence: especially when using PHP to import templates, it is easier because of these three characters, causing the browsing exception.
http://www.bkjia.com/PHPjc/321886.html www.bkjia.com true http://www.bkjia.com/PHPjc/321886.html techarticle first, the introduction of UTF-8 is a common use in web applications of a Unicode character encoding method, the advantage of using UTF-8 is that it is a variable length encoding, for ANSII code encoding ...