The difference between utf-8 and utf-8 without BOM

Source: Internet
Author: User
Tags blank page
Bom--byte order mark, which is the byte-order mark





There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted. This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.



The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF. So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.



In a UTF-8 encoded file, the BOM occupies three bytes. If you use Notepad to save a text file as UTF-8 encoding, open the file with your UE, switch to the hexadecimal edit State to see the beginning of the Fffe. This is a good way to identify the UTF-8 encoded file, the software through the BOM to identify whether this file is UTF-8 encoding, many software also requires that the file read must have a BOM. However, there are still many software that do not recognize the BOM.



In the early version of Firefox, there is no BOM in the extension, but the version after Firefox 1.5 has started to support the BOM. It is now also found that PHP does not support the BOM. PHP did not consider BOM at design time, that is to say, he will not ignore the UTF-8 encoded file at the beginning of the BOM three characters.



Because it must be seen on the Bo-blog wiki, the same PHP Bo-blog is also plagued by the BOM. One of the other problems mentioned was that "the cookie is limited by the mechanism, and the cookie cannot be sent in a file with a BOM at the beginning of the file (because PHP has sent the file header before the cookie is sent out), so the login and logout functions fail." All the functions that rely on cookies and session implementations are not valid. "This should be the reason why there is a blank page in WordPress background, because any executed file contains the BOM, these three characters will be sent, resulting in the reliance on cookies and the function of the session expires."



The solution, if it contains only English characters (or ASCII code within the character), the file is stored in ASCII mode bar. With the UE and other editors, click File-to-convert->utf-8 to ASCII, or select ASCII encoding in Save As. In the case of a line-end character in a DOS format, you can open it in Notepad, save the point as, and select ASCII encoding. If you include Chinese characters, you can use the Save as function of UE, select "UTF-8 no BOM".



Utf-8 should not have added a BOM, except to let the editor know it is a utf-8 is useless. In fact, the editor has the ability to determine the encoding of a file based on the characteristics of not too many coding formats, even if it is not automatically recognized, the editor should have a place to set the code. So I think the BOM is superfluous for utf-8.



Utf-16 only need to add BOM. Because it is encoded in Unicode order, it is two bytes in the BMP range and needs to be identified as large or small endian.



In fact, I think it's foolish to introduce the concept of utf-8 into the size byte order, and I don't know what those standards committees think. The meaning of the existence of the size byte order lies in the way the CPU is processed. If the CPU is a large byte-order processing, then for the small byte order, it is necessary to do a layer of conversion, which leads to a decrease in efficiency. But in real-world applications, who cares about size byte order? Text encoding causes the concept of byte order, only that those who set standards are too rigid. For utf-16, I think that as long as the whole world follows a byte-order approach, there's no need to label it with a BOM.



In other words, PHP does not support utf-16 encoded files. Because for example $ this symbol, in Utf-8 is also two bytes, PHP decoder cannot parse. It is not known if PHP6 internal processing introduces the concept of Unicode, whether this will be supported.



The coding problem is a simple, but actually cumbersome thing to say. Many programs have the concept of layered coding. Like MySQL, these concepts are divided into client->connection->storage and Storage->connection->result. Storage is also divided into system,database,table,column. I sometimes think it's necessary to be so complicated, tnnd. Like MySQL, who uses it for these traits? Unless two clients are allowed to operate in different coding environments, it is not necessary to separate the client code. In most cases, direct binary in/binary out is fine.






The above describes the Utf-8 and Utf-8 no BOM differences, including the aspects of the content, I hope that the PHP tutorial interested in a friend helpful.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.