PHP UTF-8, Unicode, and BOM issues _php Tutorial

Source: Internet
Author: User
Tags ultraedit
First, Introduction

UTF-8 is an encoding of Unicode characters that are often used in Web applications, and the advantage of using UTF-8 is that it is a variable-length encoding, with a length of 1 bytes for ansii encoding, so that when a page with a large number of ASCII character sets is transmitted, Network bandwidth can be massively saved.
The UTF-8 signature (UTF-8 signature), also called the BOM (Byte Order mark), is a standard tag used to identify encodings in UTF encoding schemes. The BOM, which is the standard mark used in the UTF coding scheme, is the FF FE in UTF-16, and becomes the EF BB BF in the UTF-8. This tag is optional because the UTF8 byte is not in order, so it can be used to detect if a byte stream is UTF-8 encoded. Microsoft does this kind of testing, but some software does not do this and treats it as a normal character. Microsoft in its own UTF-8 format text file before adding the EF BB bf Three bytes, Windows above the Notepad and other programs based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only Microsoft secretly made the mark, Other platforms do not have such a mark on the UTF-8 text file. This means that a UTF-8 file may have a BOM, or it may not have a BOM.
There is only one BOM, there is no problem. If multiple files are signed, the binary stream contains multiple UTF-8 signatures, which is the "root element must be well-formed" cause of the XML conversion failure.

second, view and convert

Since a UTF-8 file may have a BOM, it may not, how to distinguish it?
Just use software with hexadecimal editing, for example, to open a file with UltraEdit-32, switch to hex edit mode, and see if there is an EF BB BF on the head of the file. There is a BOM mode.
Windows comes with a Notepad Notepad, which, when saved as UTF-8, comes with a BOM by default.
There are many methods of conversion, common UltraEdit-32 or notepad++ can be, take UltraEdit-32 as an example. After opening the file, select "Save As" and in the "format" column you have the following selections:



In addition, DreamWeaver CS3 has similar options, and in preferences, if you choose Unicode (UTF-8) as the default encoding, you can select the include Unicode signature (BOM) option to include a byte order mark (BOM) in your document. Otherwise, with no BOM:

third, other knowledge
Learn from the Http://blog.csdn.net/thimin/archive/2007/08/03/1724393.aspx article:
The so-called Unicode-saved file is actually utf-16, just like Unicode code, but conceptually Unicode is different from UTF, Unicode is a memory-encoded representation scheme, and UTF is a scheme for saving and transmitting Unicode. The utf-16 is also divided into high-top (LE) and high-post (be) two kinds. The official UTF code is also utf-32, also divided by Le and be. The non-Unicode official UTF encoding also has utf-7, which is mainly used for mail transmission. The single-byte portion of the UTF-8 is compatible with Iso-8859-1, which is mainly forced out by some old system and library functions that do not handle utf-16 correctly, and also saves the file space (at the expense of non-English characters wasted space) on the English character. In Iso-8859-1, UTF8 and iso-8859-1 are represented in one byte, and Utf-8 uses two or three bytes when representing other characters.

A more detailed description of the BOM, from here:
There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted. This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.
The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF. So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.
Windows uses a BOM to mark the way a text file is encoded.

PHP also does not support BOM.
PHP did not consider BOM at design time, that is to say, he will not ignore the UTF-8 encoded file at the beginning of the BOM three characters. Because of the need to ※ Add one sentence: especially when using PHP to import templates, it is easier because of these three characters, causing the browsing exception.

http://www.bkjia.com/PHPjc/321886.html www.bkjia.com true http://www.bkjia.com/PHPjc/321886.html techarticle first, the introduction of UTF-8 is a common use in web applications of a Unicode character encoding method, the advantage of using UTF-8 is that it is a variable length encoding, for ANSII code encoding ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.