PHP UTF-8, Unicode and BOM issues _php Tips

Source: Internet
Author: User
Tags ultraedit
First, introduce

UTF-8 is a type of Unicode character that is often used in Web applications, and the advantage of using UTF-8 is that it is a variable length encoding for a ansii code length of 1 bytes, so that when a page with a large number of ASCII character sets is transmitted, Can save a lot of network bandwidth.
The UTF-8 signature (UTF-8 signature), also known as the BOM (Byte order mark), is the standard tag used to identify the encoding in the UTF encoding scheme. BOM, is the UTF coding scheme used to identify the code of the standard mark, in the UTF-16 is originally FF FE, into the UTF-8 became the EF BB BF. This tag is optional because UTF8 bytes have no order, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this kind of testing, but some software does not do this kind of testing and treats it as a normal character. Microsoft in its own UTF-8 format text file before adding the EF BB bf Three bytes, Windows above the Notepad and other programs is based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only Microsoft secretly mark, The UTF-8 text file is not marked as such on other platforms. That is, a UTF-8 file may have a BOM, or it may not have a BOM.
There is only one BOM, there is no problem. If multiple files are signed, multiple UTF-8 signatures are included in the binary stream, which is the "root element must be well-formed" reason that caused the XML transformation to fail.

Ii. viewing and converting

Since a UTF-8 file may have a BOM, it may not, then how to distinguish it?
Just use hex-editing software, for example, open the file with UltraEdit-32, switch to hexadecimal edit mode, and see if the file has an EF BB BF on the head. There is, then with the BOM method.
Windows comes with Notepad Notepad, when saved as UTF-8, the default is with the BOM.
There are many ways to convert, and common UltraEdit-32 or notepad++ can be, take UltraEdit-32 as an example. When you open the file, select Save As, and in the format column, you have the following selections:



In addition, DreamWeaver CS3 has similar options, in preferences, if you select Unicode (UTF-8) as the default encoding, you can select the include Unicode signature (BOM) option to include the byte order mark (BOM) in the document. Otherwise, no BOM is taken:

third, other knowledge
From the Http://blog.csdn.net/thimin/archive/2007/08/03/1724393.aspx article to understand:
The so-called Unicode saved files are actually utf-16, just the same as Unicode code, but conceptually Unicode and UTF are two different things, Unicode is a memory encoding scheme, and UTF is how to save and transmit Unicode. Utf-16 is also high in the front (LE) and high in the back (BE) two kinds. The official UTF code also has utf-32, also divided Le and be. Non-Unicode official UTF code also has utf-7, mainly for message transfer. The single-byte portion of the UTF-8 is compatible with Iso-8859-1, which is largely forced out of the old system and library functions that are not properly handled by the utf-16, and also for English characters, saving the file space (at the expense of space wasted by non-English characters). In Iso-8859-1, UTF8 and iso-8859-1 are represented in one byte, and utf-8 use two or three bytes when representing other characters.

A more detailed description of the BOM, from here:
In the UCS code there is a character called ZERO WIDTH No-break Space, and its encoding is Feff. Fffe is not present in UCS, so it should not appear in the actual transmission. UCS specification recommended that we transfer the byte stream before the transmission of the character "ZERO WIDTH no-break space." This means that if the recipient receives the Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. Therefore, the character "ZERO WIDTH No-break Space" is also called the BOM.
UTF-8 does not require a BOM to indicate byte order, but you can use a BOM to indicate how the encoding is encoded. The character "ZERO WIDTH no-break Space" UTF-8 code is the EF BB BF. So if the receiver receives the byte stream at the beginning of the EF BB BF, it will know that this is UTF-8 code.
Windows uses a BOM to mark the encoding of a text file.

nor does PHP support BOM.
PHP does not consider the issue of the BOM at design time, that is, he will not ignore the three characters of the BOM at the beginning of the UTF-8 encoded file. The three characters will be output directly because the code that must be followed by or <?php will be executed as PHP code. If the plugin file has this problem, will cause in the background page to activate or not activate the plugin after the display screen, if the template file has this problem, will cause these three characters direct output, resulting in a small empty line above the page. Foreign English plug-ins and templates are generally used in the encoding of ASCII code, there will be no BOM, only the domestic plug-ins and templates will be due to the author's ignorance caused problems. Also, when you modify the template, because the output page using UTF-8 encoding, then modify the template if you have to add Chinese characters, the file must be converted into UTF-8 encoding to normal display, this time if the editor used automatically added a BOM, will result in the output of these three characters on the page , the display effect will see the browser, is generally a blank line or a garbled.
※ Add: Especially when you use PHP to import templates, it is easier because of these three characters, resulting in a browse exception.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.