Note: Please refer to UTF-8, UTF-16, UTF-32 & BOM for a detailed description of Unicode related knowledge.
For UTF-8/16/32, the 8/16/32 in their name refers to the number of bits in the encoding unit, that is, their coding units are 8/16/32 bits, and the bytes are 1/2/4 bytes, and if it is multi-byte, it involves the byte sequence. UTF-8 is encoded as a single byte, so no byte order exists.
UTF-8 The main advantage is that it can be compatible with ASCII, but if the use of the BOM, this benefit is gone, in addition, the existence of the BOM may also cause some problems, such as the following error may be caused by the BOM:
Shell: #!/bin/sh:no such file or directory
PHP:Warning:Cannot Modify header Information–headers already sent
Before discussing the problem of BOM detection and deletion in UTF-8 coding, it is advisable to warm up by an example:
shell> Curl-s http://phone.jb51.net/| head-1 | Sed-n L
\357\273\277<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional\
EN "" Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">\r$
As shown above, the first three bytes are 357, 273, 277 respectively, and this is the BOM for octal.
shell> Curl-s http://phone.jb51.net/| head-1 | Hexdump-c
00000000 EF BB BF 3c 4f (6d |...<!) DOCTYPE htm|
00000010 6c 4c 2d 2f 2f (|l) public "-//w3c|
00000020 2f 2f, 4d 4c, 2e |//dtd XHTML 1.0 |
00000030 6e-, 6f 6e, 6c 2f 2f-4e | transitional//en|
00000040 3a 2f 2f (33) 2e 77 | " "Http://www.w3|
00000050 2e 6f |.org/tr/xhtml1/d| 2f-2f-MB-6d-6c-2f
00000060 2f (6d 6c 2d) 73 69 | td/xhtml1-transi|
00000070 6f 6e 6c 2e-|tional.dtd 3e 0d 0a >..|
As shown above, the first three bytes are EF, BB, BF, which is the hexadecimal BOM.
Note: Using the pages of a third party Web site, you cannot guarantee that examples are always available.
Actually do project development, may face hundreds of text files, if there are several files mixed with the BOM, it is difficult to detect, if there is no BOM with the UTF-8 text file examples, you can use VI to fabricate a few, the relevant orders are as follows:
#设置UTF-8 encoding
: Set Fileencoding=utf-8
#添加BOM
: Set Bomb
#删除BOM
: Set Nobomb
#查询BOM
: Set bomb?
How to detect the BOM in UTF-8 code?
Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path
How do I delete the BOM in the UTF-8 code?
Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path | Xargs sed-i ' s/^\xef\xbb\xbf//;q '
Recommendation: If you use SVN, you can add the relevant code in the Pre-commit hook to eliminate the BOM.
Copy Code code as follows:
#!/bin/sh
Repos= "$"
Txn= "$"
Svnlook=/usr/bin/svnlook
files= ' $SVNLOOK changed-t "$TXN" "$REPOS" | awk '/^[ua]/{print $} '
For FILE in $FILES; Todo
If $SVNLOOK cat-t "$TXN" "$REPOS" "$FILE" | Grep-q $ ' ^\xef\xbb\xbf '; Then
echo "Byte order Mark is found in $FILE" 1>&2
Exit 1
Fi
Done
This article uses a lot of shell commands, the length of the limit, I do not elaborate, if you do not understand it please search yourself.