Issues with UTF-8, Unicode, and BOM

Source: Internet
Author: User
Tags ultraedit
The common problem is that after BOM encoding is used, an error occurs in script execution or an error occurs when filestream is used to read and convert data to XML."
Markup in the document following the root element must be well-formed .".
I. Introduction
UTF-8 is a Unicode character encoding method that is often used in Web applications.
The advantage of UTF-8 is that it is a variable-length encoding method
The encoding length is 1 byte, so that a large amount of ASCII data is transmitted.
Web pages with character sets can greatly save network bandwidth.
UTF-8 signature (UTF-8 signature) is also called Bom (byte order
Mark), which is the standard mark used to identify the encoding in the UTF Encoding scheme. Bom is the standard mark used to mark the code in the UTF Encoding scheme, which is originally ff in the UTF-16
Fe, into the UTF-8 becomes EF bb
BF. This flag is optional because utf8 bytes are not sequential, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this kind of detection, but some software does not do this kind of detection, and treats it as a normal character. Microsoft added EF before its own text file in UTF-8 format
Bb BF three bytes,
The Notepad program on Windows is based on the three bytes to determine whether a text file is ascii or UTF-8,
However, this is only a mark by Microsoft,
Other platforms do not make such a mark on UTF-8 text files. That is to say, a UTF-8 file may have a Bom, or it may have no Bom.
There is only one Bom. If signatures are set for multiple files, the binary stream contains multiple UTF-8 signatures, that is, the "root" that causes XML Conversion to fail.
Element must be well-formed "reason.
Ii. View and convert
Since a UTF-8 file may have a BOM or no, how should we differentiate it?
Just use the software with hexadecimal editing, for example, open the file with a UltraEdit-32, switch to hexadecimal editing mode, check whether the file header has ef
Bb BF. Yes, the BOM mode is used.
Windows built-in notepad, save as a UTF-8, the default with Bom.
There are many conversion methods, common UltraEdit-32 or notepad ++ can be, take UltraEdit-32 as an example. After opening the file, select "Save as". The following options are available in the "format" column:
In addition, Dreamweaver CS3 has similar options. In "Preferences", if Unicode is selected
(UTF-8) as the default encoding, you can select "include Unicode Signature
(BOM) "option to include the byte sequence mark (BOM) in the document ). Otherwise, Bom is not included:
3. Other knowledge
The so-called Unicode file is actually a UTF-16, but it is exactly the same as the Unicode code, but in terms of concept, Unicode and UTF are two different, Unicode is a memory encoding representation scheme, UTF saves and transfers Unicode. UTF-16 is still at the top
(LE) and high are in the back (be. The official UTF Code also includes utf-32, which can be Le and be. Non-Unicode official UTF Encoding also has utf-7, mainly used for mail Transmission. The single-byte part of UTF-8 is compatible with the iso-8859-1, which is primarily forced out of some old systems and library functions that cannot properly handle the UTF-16, and for English characters, it also saves storage space (at the cost of non-English characters wasting space ). In the iso-8859-1, both utf8 and iso-8859-1 are represented in one byte, and when it represents other characters, UTF-8 uses two or three bytes.
Here is a more detailed description of BOM:
There is a Zero Width no-break in the UCS encoding.
Space, which is encoded as feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "zero" before transmitting the byte stream in the UCS specification.
Width no-break
Space ". In this way, if the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian. Therefore, the character "zero"
Width no-break space "is also called Bom.
The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. Character "Zero Width
The UTF-8 code for no-break space "is ef bb bf. Therefore, if the recipient receives
The byte stream starting with BF knows this is UTF-8 encoding.
Windows uses BOM to mark the encoding of text files.
PHP does not support BOM either.
PHP did not consider the BOM issue during design, that is, he would not ignore the three characters at the beginning of the BOM in a UTF-8-encoded file. Because it must be in the <? Or <? The code after PHP will be executed as PHP code, so these three characters will be output directly. If the plug-in file has this problem, it will display a white screen after activating or not activating the plug-in on the background page. If it is a template file, these three characters are output directly, resulting in a small empty line at the top of the page. English plug-ins and templates in foreign countries are generally encoded using ASCII codes without Bom. Only domestic plug-ins and templates may cause problems without the author's knowledge. Also, when you modify the template, because the output page uses UTF-8 encoding, then modify the template if there are added Chinese characters, you must convert the file into UTF-8 encoding to display normally, in this case, if the BOM is automatically added to the editor, the three characters will be output on the page. The display effect depends on the browser, which is generally a blank line or garbled code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.