Note: For more information about Unicode, see UTF-8, UTF-16, UTF-32 & BOM.
For UTF-8/16/32, 8/16/32 in their names refer to the number of bits in the encoding unit, that is, their encoding units are 8/16/32 bits, to convert to byte is 1/2/4 bytes, if it is multi-byte, it will involve the byte order, the UTF-8 in a single byte as the encoding unit, so there is no byte order.
The main advantage of UTF-8 is that it can be compatible with ASCII, but if you use BOM, this advantage is gone, in addition, the existence of BOM may also lead to some problems, for example, the following errors may be caused by BOM:
Shell :#! /Bin/sh: No such file or directory
PHP: Warning: Cannot modify header information-headers already sent
Before discussing in detail the problem of BOM detection and deletion in UTF-8 coding, we may try to warm up with an example:
Shell> curl-s http://phone.jb51.net/| head-1 | sed-n l
\ 357 \ 273 \ 277 <! DOCTYPE html PUBLIC "-// W3C // dtd xhtml 1.0 Transitional \
// EN "" http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "> \ r$
As shown above, the first three bytes are 357, 273, and 277, which are the BOM of the octal component.
Shell> curl-s http://phone.jb51.net/| head-1 | hexdump-C
00000000 ef bb bf 3c 21 44 4f 43 54 59 50 45 20 68 74 6d |... <! DOCTYPE htm |
00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 57 33 43 | l PUBLIC "-// W3C |
00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 | // dtd xhtml 1.0 |
00000030 54 72 61 6e 73 69 74 69 6f 6e 61 6c 2f 2f 45 4e | Transitional // EN |
00000040 22 20 22 68 74 70 3a 2f 2f 77 77 77 2e 77 33 | "" http: // www. w3 |
00000050 2e 6f 72 67 2f 54 52 2f 78 68 74 6d 6c 31 2f 44 |. org/TR/xhtml1/D |
00000060 54 44 2f 78 68 74 6d 6c 31 2d 74 72 61 6e 73 69 | TD/xhtml1-transi |
00000070 74 69 6f 6e 61 6c 2e 64 74 64 22 3e 0d 0a | tional. dtd ">... |
As shown above, the first three bytes are EF, BB, and BF, which are the hexadecimal BOM.
Note: When a third-party website page is used, examples cannot be always available.
In actual project development, may face hundreds of thousands of text files, if a few files mixed into the BOM, it is difficult to notice, if there is no BOM UTF-8 text file example, you can use vi to write several articles. The related commands are as follows:
# Set UTF-8 Encoding
: Sets fileencoding = UTF-8
# Add BOM
: Set bomb
# Delete BOM
: Set nobomb
# Query BOM
: Set bomb?
How to check BOM in UTF-8 coding?
Shell> grep-r-I-l $ '^ \ xEF \ xBB \ xBF'/path
How do I delete BOM from UTF-8 encoding?
Shell> grep-r-I-l $ '^ \ xEF \ xBB \ xBF'/path | xargs sed-I's/^ \ xEF \ xBB \ xBF //; q'
Recommendation: If you use SVN, you can add relevant code to the pre-commit hook to prevent BOM.Copy codeThe Code is as follows :#! /Bin/sh
REPOS = "$1"
TXN = "$2"
SVNLOOK =/usr/bin/svnlook
FILES = '$ SVNLOOK changed-t "$ TXN" "$ REPOS" | awk'/^ [UA]/{print $2 }''
For FILE in $ FILES; do
If $ SVNLOOK cat-t "$ TXN" "$ REPOS" "$ FILE" | grep-q $ '^ \ xEF \ xBB \ xBF'; then
Echo "Byte Order Mark be found in $ FILE" 1> & 2
Exit 1
Fi
Done
Many shell commands are used in this article, which are limited in length and will not be described in detail. If you do not understand them, please search for them by yourself.