The difference between utf-8 and utf-8 without BOM

Last Update:2016-08-08 Source: Internet

Author: User

Tags blank page

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bom--byte order mark, which is the byte-order mark

There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted. This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.

The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF. So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.

In a UTF-8 encoded file, the BOM occupies three bytes. If you use Notepad to save a text file as UTF-8 encoding, open the file with your UE, switch to the hexadecimal edit State to see the beginning of the Fffe. This is a good way to identify the UTF-8 encoded file, the software through the BOM to identify whether this file is UTF-8 encoding, many software also requires that the file read must have a BOM. However, there are still many software that do not recognize the BOM.

In the early version of Firefox, there is no BOM in the extension, but the version after Firefox 1.5 has started to support the BOM. It is now also found that PHP does not support the BOM. PHP did not consider BOM at design time, that is to say, he will not ignore the UTF-8 encoded file at the beginning of the BOM three characters.

Because it must be seen on the Bo-blog wiki, the same PHP Bo-blog is also plagued by the BOM. One of the other problems mentioned was that "the cookie is limited by the mechanism, and the cookie cannot be sent in a file with a BOM at the beginning of the file (because PHP has sent the file header before the cookie is sent out), so the login and logout functions fail." All the functions that rely on cookies and session implementations are not valid. "This should be the reason why there is a blank page in WordPress background, because any executed file contains the BOM, these three characters will be sent, resulting in the reliance on cookies and the function of the session expires."

The solution, if it contains only English characters (or ASCII code within the character), the file is stored in ASCII mode bar. With the UE and other editors, click File-to-convert->utf-8 to ASCII, or select ASCII encoding in Save As. In the case of a line-end character in a DOS format, you can open it in Notepad, save the point as, and select ASCII encoding. If you include Chinese characters, you can use the Save as function of UE, select "UTF-8 no BOM".

Utf-8 should not have added a BOM, except to let the editor know it is a utf-8 is useless. In fact, the editor has the ability to determine the encoding of a file based on the characteristics of not too many coding formats, even if it is not automatically recognized, the editor should have a place to set the code. So I think the BOM is superfluous for utf-8.

Utf-16 only need to add BOM. Because it is encoded in Unicode order, it is two bytes in the BMP range and needs to be identified as large or small endian.

In fact, I think it's foolish to introduce the concept of utf-8 into the size byte order, and I don't know what those standards committees think. The meaning of the existence of the size byte order lies in the way the CPU is processed. If the CPU is a large byte-order processing, then for the small byte order, it is necessary to do a layer of conversion, which leads to a decrease in efficiency. But in real-world applications, who cares about size byte order? Text encoding causes the concept of byte order, only that those who set standards are too rigid. For utf-16, I think that as long as the whole world follows a byte-order approach, there's no need to label it with a BOM.

In other words, PHP does not support utf-16 encoded files. Because for example $ this symbol, in Utf-8 is also two bytes, PHP decoder cannot parse. It is not known if PHP6 internal processing introduces the concept of Unicode, whether this will be supported.

The coding problem is a simple, but actually cumbersome thing to say. Many programs have the concept of layered coding. Like MySQL, these concepts are divided into client->connection->storage and Storage->connection->result. Storage is also divided into system,database,table,column. I sometimes think it's necessary to be so complicated, tnnd. Like MySQL, who uses it for these traits? Unless two clients are allowed to operate in different coding environments, it is not necessary to separate the client code. In most cases, direct binary in/binary out is fine.

The above describes the Utf-8 and Utf-8 no BOM differences, including the aspects of the content, I hope that the PHP tutorial interested in a friend helpful.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More