UTF byte order and BOM
The byte order of the UTF-8UTF and BOM are encoded in bytes, there is no problem of the byte order. The UTF-16 uses two bytes as the encoding unit. before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, if the Unicode encoding of "queue" is 594e and that of "B" is 4e59. If we receive the UTF-16 byte stream "594e", is this "Kui" or "B "?
The recommended method for marking byte order in Unicode specifications is Bom. Bom is not a "bill of material" Bom, but a byte order mark. Bom is a bit clever:
There is a character named "Zero Width no-break space" in the UCS encoding, and its encoding is feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ".
In this way, if the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian. Therefore, the character "Zero Width no-break space" is also called Bom.
The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf (the reader can verify it with the encoding method we described earlier ). So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.
Windows uses BOM to mark the encoding of text files.
On the Windows platform, there is one of the simplest transformations. Instead, you can use the built-in deployment mini-program notepad.exe. After opening the file, click "Save as" in the "file" menu. A dialog box is displayed, with a "encoding" drop-down at the bottom.
There are four options: ANSI, Unicode, Unicode big endian and UTF-8.
1) ANSI is the default encoding method. English files are ASCII encoded files, while simplified Chinese files are gb2312 encoded files (only for Windows Simplified Chinese versions, if they are traditional Chinese versions, big5 codes will be used ).
2) unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes into the character Unicode code. This option uses the little endian format.
3) Unicode big endian encoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.
4) UTF-8 coding, that is, the encoding method mentioned in the previous section.
After selecting "encoding method", click "save" to convert the file encoding method immediately.
Little endian and big endian
As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Take the Chinese character "Yan" as an example. The Unicode code is 4e25 and needs to be stored in two bytes. one byte is 4E and the other byte is 25. During storage, 4e is in the front, 25 is in the back, that is, the big endian mode; 25 is in the front, and 4E is in the little endian mode.
These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.
Therefore, the first byte is in front of "Big endian", and the second byte is in front of "little endian ).
Naturally, a problem arises: how does a computer know which encoding method is used for a file?
As defined in the Unicode specification, a character indicating the encoding sequence is added at the beginning of each file. The name of this character is "Zero Width, non-line feed space" (Zero Width, no-break space ), expressed in feff. This is exactly two bytes, and FF is 1 larger than Fe.
If the first two bytes of a text file are Fe ff, it indicates that the file adopts the big header mode. If the first two bytes are FF Fe, it indicates that the file adopts the Small Header mode.
Instance
The following is an example.
Open notepad.exe, the Notepad program, and create a new text file. The content is a strict character, which is saved in sequence using ANSI, Unicode, Unicode big endian, and UTF-8 encoding.
Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.
1) ANSI: The file encoding is two bytes: "D1 CF", which is exactly the "strict" gb2312 encoding. This also implies that gb2312 is stored in a big-headed manner.
2) UNICODE: the encoding is four bytes: "FF Fe 25 4E", where "ff fe" indicates that it is stored in Small Header mode, and the actual encoding is 4e25.
3) Unicode big endian: the encoding format is four bytes: "Fe FF 4E 25", and "Fe FF" indicates that it is stored in the big data storage mode.
4) UTF-8: the encoding is six bytes "Ef bb bf E4 B8 A5", the first three bytes "Ef bb bf" indicates this is UTF-8 encoding, the last three "e4b8a5" are "strict" encoding, and their storage sequence is consistent with the encoding sequence.
UTF byte order and BOM