Issues with phpUTF-8, Unicode, and BOM

Source: Internet
Author: User
Tags ultraedit
The common problem is that, after BOM encoding is used, PHP script execution errors, or the error Themarkupinthedocumentfollowingtherootelementmustbewell-formed will be reported if fileStream is used to read and convert to XML ..

The common problem is that PHP script execution errors occur after BOM encoding is used, or when you use fileStream to read and convert to XML, The markup in the document following the root element must be well-formed ..

I. Introduction

UTF-8 is a Unicode character encoding method that is often used in web applications. The advantage of using UTF-8 is that it is a variable length encoding method, the length of the ANSII code is 1 byte. In this way, network bandwidth can be greatly reduced when a large number of ASCII character sets of webpages are transmitted.
UTF-8 signature, also known as BOM (Byte Order Mark), is the standard tag used for identification encoding in the UTF Encoding scheme. BOM is the standard mark used in the UTF Encoding scheme to mark the code, which is ff fe in the UTF-16 and becomes ef bb bf. This flag is optional because UTF8 bytes are not sequential, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this kind of detection, but some software does not do this kind of detection, and treats it as a normal character. Microsoft added ef bb bf three bytes before its own text file in UTF-8 format, the notepad and other programs on windows are based on these three bytes to determine whether a text file is ASCII or UTF-8, but this is only a mark by Microsoft, other platforms do not make such a mark on UTF-8 text files. That is to say, a UTF-8 file may have a BOM, or it may have no BOM.
There is only one BOM. If signatures are set for multiple files, multiple UTF-8 signatures are contained in the binary stream, which is the cause of the "root element must be well-formed" Failure in XML Conversion.

Ii. View and convert

Since a UTF-8 file may have a BOM or no, how should we differentiate it?
As long as you use software with hexadecimal editing, for example, open the file with a UltraEdit-32, switch to hexadecimal editing mode, check whether the file header has ef bb bf. Yes, the BOM mode is used.
Windows built-in notepad, save as a UTF-8, the default with BOM.
There are many conversion methods, common UltraEdit-32 or NotePad ++ can be, take UltraEdit-32 as an example. After opening the file, select "Save as". The following options are available in the "format" column:



In addition, DreamWeaver CS3 has similar options. In preferences, if Unicode (UTF-8) is selected as the default encoding, you can select the include Unicode signature (BOM) option, in order to include the byte sequence mark (BOM) in the document ). Otherwise, BOM is not included:


I learned from this article:
The so-called unicode file is actually a UTF-16, but it is exactly the same as the unicode code, but in terms of concept, unicode and utf are two different, unicode is a memory encoding representation scheme, utf saves and transfers unicode. The UTF-16 is also divided into two types: High Front (LE) and high behind (BE. The official utf Code also includes utf-32, which can be le and BE. Non-unicode official utf Encoding also has utf-7, mainly used for mail Transmission. The single-byte part of UTF-8 is compatible with the iso-8859-1, which is primarily forced out of some old systems and library functions that cannot properly handle the UTF-16, and for English characters, it also saves storage space (at the cost of non-English characters wasting space ). In the iso-8859-1, both utf8 and iso-8859-1 are represented in one byte, and when it represents other characters, UTF-8 uses two or three bytes.

For more information about BOM, refer:
There is a character named "zero width no-break space" in the UCS encoding, and its encoding is FEFF. FFFE does not exist in the UCS, so it should not appear in actual transmission. We recommend that you transmit the character "zero width no-break space" before transmitting the byte stream in the UCS specification ". In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore, the character "zero width no-break space" is also called BOM.
The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "zero width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.
Windows uses BOM to mark the encoding of text files.

PHP does not support BOM either.
PHP did not consider the BOM issue during design, that is, he would not ignore the three characters at the beginning of the BOM in a UTF-8-encoded file. Because it must be in

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.