The technique of checking and deleting BOM in UTF-8 coding

Last Update:2017-01-18 Source: Internet

Author: User

Tags curl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Note: Please refer to UTF-8, UTF-16, UTF-32 & BOM for a detailed description of Unicode related knowledge.

For UTF-8/16/32, the 8/16/32 in their name refers to the number of bits in the encoding unit, that is, their coding units are 8/16/32 bits, and the bytes are 1/2/4 bytes, and if it is multi-byte, it involves the byte sequence. UTF-8 is encoded as a single byte, so no byte order exists.

UTF-8 The main advantage is that it can be compatible with ASCII, but if the use of the BOM, this benefit is gone, in addition, the existence of the BOM may also cause some problems, such as the following error may be caused by the BOM:

Shell: #!/bin/sh:no such file or directory
PHP:Warning:Cannot Modify header Information–headers already sent

Before discussing the problem of BOM detection and deletion in UTF-8 coding, it is advisable to warm up by an example:

shell> Curl-s http://phone.jb51.net/| head-1 | Sed-n L
\357\273\277<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional\
EN "" Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">\r$

As shown above, the first three bytes are 357, 273, 277 respectively, and this is the BOM for octal.

shell> Curl-s http://phone.jb51.net/| head-1 | Hexdump-c
00000000 EF BB BF 3c 4f (6d |...<!) DOCTYPE htm|
00000010 6c 4c 2d 2f 2f (|l) public "-//w3c|
00000020 2f 2f, 4d 4c, 2e |//dtd XHTML 1.0 |
00000030 6e-, 6f 6e, 6c 2f 2f-4e | transitional//en|
00000040 3a 2f 2f (33) 2e 77 | " "Http://www.w3|
00000050 2e 6f |.org/tr/xhtml1/d| 2f-2f-MB-6d-6c-2f
00000060 2f (6d 6c 2d) 73 69 | td/xhtml1-transi|
00000070 6f 6e 6c 2e-|tional.dtd 3e 0d 0a >..|

As shown above, the first three bytes are EF, BB, BF, which is the hexadecimal BOM.

Note: Using the pages of a third party Web site, you cannot guarantee that examples are always available.

Actually do project development, may face hundreds of text files, if there are several files mixed with the BOM, it is difficult to detect, if there is no BOM with the UTF-8 text file examples, you can use VI to fabricate a few, the relevant orders are as follows:

#设置UTF-8 encoding
: Set Fileencoding=utf-8
#添加BOM
: Set Bomb
#删除BOM
: Set Nobomb
#查询BOM
: Set bomb?

How to detect the BOM in UTF-8 code?

Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path

How do I delete the BOM in the UTF-8 code?

Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path | Xargs sed-i ' s/^\xef\xbb\xbf//;q '

Recommendation: If you use SVN, you can add the relevant code in the Pre-commit hook to eliminate the BOM.

Copy Code code as follows:

#!/bin/sh

Repos= "$"
Txn= "$"

Svnlook=/usr/bin/svnlook

files= ' $SVNLOOK changed-t "$TXN" "$REPOS" | awk '/^[ua]/{print $} '

For FILE in $FILES; Todo
If $SVNLOOK cat-t "$TXN" "$REPOS" "$FILE" | Grep-q $ ' ^\xef\xbb\xbf '; Then
echo "Byte order Mark is found in $FILE" 1>&2
Exit 1
Fi
Done

This article uses a lot of shell commands, the length of the limit, I do not elaborate, if you do not understand it please search yourself.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More