The technique of checking and deleting BOM in UTF-8 coding

Source: Internet
Author: User
Tags curl
Note: Please refer to UTF-8, UTF-16, UTF-32 & BOM for a detailed description of Unicode related knowledge.

For UTF-8/16/32, the 8/16/32 in their name refers to the number of bits in the encoding unit, that is, their coding units are 8/16/32 bits, and the bytes are 1/2/4 bytes, and if it is multi-byte, it involves the byte sequence. UTF-8 is encoded as a single byte, so no byte order exists.

UTF-8 The main advantage is that it can be compatible with ASCII, but if the use of the BOM, this benefit is gone, in addition, the existence of the BOM may also cause some problems, such as the following error may be caused by the BOM:

Shell: #!/bin/sh:no such file or directory
PHP:Warning:Cannot Modify header Information–headers already sent

Before discussing the problem of BOM detection and deletion in UTF-8 coding, it is advisable to warm up by an example:

shell> Curl-s http://phone.jb51.net/| head-1 | Sed-n L
\357\273\277<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional\
EN "" Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">\r$

As shown above, the first three bytes are 357, 273, 277 respectively, and this is the BOM for octal.

shell> Curl-s http://phone.jb51.net/| head-1 | Hexdump-c
00000000 EF BB BF 3c 4f (6d |...<!) DOCTYPE htm|
00000010 6c 4c 2d 2f 2f (|l) public "-//w3c|
00000020 2f 2f, 4d 4c, 2e |//dtd XHTML 1.0 |
00000030 6e-, 6f 6e, 6c 2f 2f-4e | transitional//en|
00000040 3a 2f 2f (33) 2e 77 | " "Http://www.w3|
00000050 2e 6f |.org/tr/xhtml1/d| 2f-2f-MB-6d-6c-2f
00000060 2f (6d 6c 2d) 73 69 | td/xhtml1-transi|
00000070 6f 6e 6c 2e-|tional.dtd 3e 0d 0a >..|

As shown above, the first three bytes are EF, BB, BF, which is the hexadecimal BOM.

Note: Using the pages of a third party Web site, you cannot guarantee that examples are always available.

Actually do project development, may face hundreds of text files, if there are several files mixed with the BOM, it is difficult to detect, if there is no BOM with the UTF-8 text file examples, you can use VI to fabricate a few, the relevant orders are as follows:

#设置UTF-8 encoding
: Set Fileencoding=utf-8
#添加BOM
: Set Bomb
#删除BOM
: Set Nobomb
#查询BOM
: Set bomb?

How to detect the BOM in UTF-8 code?

Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path

How do I delete the BOM in the UTF-8 code?

Shell> grep-r-i-l $ ' ^\xef\xbb\xbf '/path | Xargs sed-i ' s/^\xef\xbb\xbf//;q '

Recommendation: If you use SVN, you can add the relevant code in the Pre-commit hook to eliminate the BOM.
Copy Code code as follows:

#!/bin/sh

Repos= "$"
Txn= "$"

Svnlook=/usr/bin/svnlook

files= ' $SVNLOOK changed-t "$TXN" "$REPOS" | awk '/^[ua]/{print $} '

For FILE in $FILES; Todo
If $SVNLOOK cat-t "$TXN" "$REPOS" "$FILE" | Grep-q $ ' ^\xef\xbb\xbf '; Then
echo "Byte order Mark is found in $FILE" 1>&2
Exit 1
Fi
Done

This article uses a lot of shell commands, the length of the limit, I do not elaborate, if you do not understand it please search yourself.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.