Differences between UTF-8 and UTF-8 without BOM

Source: Internet
Author: User
: This article mainly introduces the differences between UTF-8 and UTF-8 without BOM. if you are interested in the PHP Tutorial, please refer to it. BOM -- Byte Order Mark, which is a Byte Mark

There is a character named "zero width no-break space" in the UCS encoding, and its encoding is FEFF. FFFE does not exist in the UCS, so it should not appear in actual transmission. We recommend that you transmit the character "zero width no-break space" before transmitting the byte stream in the UCS specification ". In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore, the character "zero width no-break space" is also called BOM.

The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "zero width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.

In an UTF-8 encoded file, BOM occupies three bytes. If you use notepad to save a text file as a UTF-8 encoding method, open the file with UE, switch to the hexadecimal editing status, you can see the open FFFE. This is a good way to identify the UTF-8 encoding file, the software through BOM to identify whether the file is UTF-8 encoding, many software also requires that the file to be read must carry BOM. Yes, there are still a lot of software that cannot recognize BOM.

In earlier versions of Firefox, BOM is not available for extensions, but later versions of Firefox 1.5 have started to support BOM. Now, PHP does not support BOM. PHP did not consider the BOM issue during design, that is, he would not ignore the three characters at the beginning of the BOM in a UTF-8-encoded file.

It must be seen on the Bo-Blog wiki that the Bo-Blog that uses PHP is also troubled by BOM. Another problem was mentioned: "restricted by the COOKIE sending mechanism, in files with BOM at the beginning of these files, the COOKIE cannot be sent (because PHP has already sent a file header before sending the COOKIE), so the login and logout functions are invalid. All functions dependent on cookies and sessions are invalid ." This should be the reason why a blank page appears in the Wordpress background. because any executed file contains BOM, all three characters will be sent, resulting in invalid functionality relying on cookies and sessions.

Solution: If only English characters (or ASCII characters) are contained, save the file as an ASCII code. With the UE editor, click File> convert> UTF-8 to ASCII, or select ASCII encoding in Save. If it is a line tail character in DOS format, you can open it in Notepad, click save as, and select ASCII encoding. If it contains Chinese characters, you can use the save as function of UE, select "UTF-8 without BOM.

In addition to letting the editor know that UTF-8 is useless, bom should not be added. In fact, the editor has the ability to determine the encoding of a file based on features in a few encoding formats. even if the file cannot be recognized automatically, the editor should also have a place to set the encoding. So I think BOM is superfluous for UTF-8.

Bom is required for UTF-16. Because it is encoded in unicode order, it is two bytes in the BMP range and must be identified as large or small bytes.

In fact, I think it is too stupid to introduce the concept of byte order in UTF-8. I don't know what the standards committee thinks. The significance of size in byte order lies in the cpu processing method. If the cpu is processed in the big-character collation, a layer of conversion is required for the small-byte collation, which leads to a reduction in efficiency. But in practice, who cares about the size in byte order? The concept of byte order caused by text encoding can only be said that those who set standards are too rigid. For UTF-16, I think that as long as the world follows a byte order, there is no need to use BOM to mark it.

Again, PHP does not support UTF-16-encoded files. For example, the $ symbol is also two bytes in UTF-8, which cannot be parsed by the PHP decoder. I don't know whether this is supported after unicode is introduced in PHP6 internal processing.

The coding problem is a simple but cumbersome thing. Many programs have the concept of hierarchical encoding. MySQL is divided into the concepts of client-> connection-> storage and storage-> connection-> result. Storage is divided into system, database, table, and column. I sometimes think that it is necessary to make such a complicated task, TNND. Who uses these features like MySQL? Unless two clients are allowed to operate in different encoding environments, it is unnecessary to separate the client encoding. In most cases, it is better to directly use binary in/binary out.

The above introduces the differences between UTF-8 and UTF-8 without BOM, including the content, and hope to be helpful to friends who are interested in PHP tutorials.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.