UTF-8 coding Problem BOM Detailed introduction

UTF-8 coding Problem BOM Detailed introduction _ Application Technique

Last Update:2017-01-18 Source: Internet

Author: User

Tags php code blank page

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today in writing PHP code, there is a very depressing problem that is two identical files, in IE shows a file is a blank line, as shown in the address http://www.kuomart.com/blog/my_ex/bom_utf8.htm
The above blank lines appear in PHP require (' t.htm ') import template output, and my php files and htm files are written in Notepad, and then saved as UTF-8 encoded, Then there is a nodepad save UTF8 file automatically add BOM to the beginning of the file, at first test with Nodepad,dw,edplus open files can not see the BOM content, but with Windows WordPad and Zend Studio opened to see the BOM byte of things, because has been utf8 not in-depth understanding, only know UTF8 can express a lot of language encoding, he general-purpose three bytes to represent a character, such as GB code in two bytes to represent a Chinese character, and the UTF8 to represent a Chinese character, Then a Chinese character takes up three bytes. But nothing about the BOM, finally, there is no technology can be applied to csdn to help, but csdn half not a master can be resolved, but also in my web version of the issue of the section is not right (halo, I was the web development problems AH), under the Rogue and in the Phpchina to post, Finally have to Aultoale help enthusiastic answer, such as paste http://www.phpchina.com/bbs/thread-23423-1-1.html

On the Internet also find the following detailed explanation

The UTF-8 BOM problem to be noticed in WordPress
Very early encountered a problem, that is, after the installation of a plug-in, the point of activation will appear white screen. Has not been understood for what reason, the previous solution is, if it is not included in Chinese characters, directly to the file into an ASCII way, generally can be resolved. Today, when the younger brother to make a blog, this happened again. After studying for half a day, finally found the answer.

There is a concept of a BOM in the Unicode specification. Bom--byte order mark, is the byte sequence mark. Find a description of the BOM here:

In the UCS code there is a character called ZERO WIDTH No-break Space, and its encoding is Feff. Fffe is not present in UCS, so it should not appear in the actual transmission. UCS specification recommended that we transfer the byte stream before the transmission of the character "ZERO WIDTH no-break space." This means that if the recipient receives the Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. Therefore, the character "ZERO WIDTH No-break Space" is also called the BOM.

UTF-8 does not require a BOM to indicate byte order, but you can use a BOM to indicate how the encoding is encoded. The character "ZERO WIDTH no-break Space" UTF-8 code is the EF BB BF. So if the receiver receives the byte stream at the beginning of the EF BB BF, it will know that this is UTF-8 code.

Windows uses a BOM to mark the encoding of a text file.

In addition, the Faq-bom of the Unicode website introduces the BOM in detail. The official nature of authority, but English, looks more laborious.

In UTF-8 encoded files, the BOM accounts for three bytes. If you use Notepad to save a text file as UTF-8 encoding, open the file with UE, switch to hexadecimal edit state to see the beginning of the Fffe. This is a good way to identify UTF-8 encoded files, the software through the BOM to identify whether the file is UTF-8 code, many software also requires that the document must be read into the BOM. However, there are still a lot of software can not identify the BOM. When I studied Firefox, I knew that in the early versions of Firefox, there was no BOM for extensions, but the Firefox 1.5 version has already started supporting the BOM. It is now found that PHP does not support BOM.

PHP does not consider the issue of the BOM at design time, that is, he will not ignore the three characters of the BOM at the beginning of the UTF-8 encoded file. Since you must be in -->
In Bo-blog's wiki, the same use of PHP Bo-blog is also plagued by the BOM. One of the other problems mentioned was that "the cookie-delivery mechanism limits the cookie from being sent out in a file with a BOM at the beginning of the file (because PHP sent the file header before the cookie was sent out), so the login and logout function failed." All the functionality that relies on cookies and session implementations is invalid. "This should be the reason for a blank page in WordPress backstage, because any file that is executed contains a BOM, and these three characters will be sent out, resulting in a failure to rely on cookies and session functions."

The solution is to save the file as an ASCII code if it contains only English characters (or ASCII code). With the UE and other editors, the dot file-> convert->utf-8 to ASCII, or select the ASCII encoding in the Save As. If it is a DOS-formatted end-of-line character, you can open it with Notepad, save the point as, and select the ASCII encoding. If you include Chinese characters, you can use the UE of the Save As function, select "UTF-8 no BOM" can be. Please refer to the following picture:

According to Bo-blog's wiki description: EditPlus need to save as GB first, and then save as UTF-8. Be careful, however, that all characters that are not included in the GBK code are lost. If there are some non-Chinese characters in the file, or do not use this method. (from this point of view, ue--ultraedite-32 is indeed much better than EditPlus, EditPlus is too lightweight)

In addition, I found a way to use the file editor provided by WordPress. This approach is unrestricted, do not need to download a special editor, after all, we are using WordPress. First in the FTP to edit the file write permission to open, and then into the WordPress background-> management-> file Editor, enter the path to edit the file, point edit file. In the display of the editing interface, you can not see the beginning of the three characters, but it does not matter, positioning the cursor in the entire file before the first character, click the Backspace key. OK, click to update the file, in the FTP refresh, you can see the file small 3 bytes, finished.

Finally, this is a big problem, all to write their own plug-ins, edit other people's plug-ins for their own use, need to modify the template (this estimate everyone needs it), it is best to understand the above knowledge, lest there is a problem when overwhelmed.

The official website information is as follows http://www.unicode.org/faq/utf_bom.html -->

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More