Method for detecting and deleting blank rows in page BOM (UTF-8)

Source: Internet
Author: User
Tags curl pack ultraedit

The following figure shows the HTML code that you can see with firebug when you see the preceding situation.

Figure 1

There is a blank line in it, but we don't see it in the source code.


My most common method is to use php to replace

BOM: Wanguo code file signature BOM (Byte Order Mark, U + FEFF)

The BOM content can indicate the UNICODE encoding, but after receiving the archive, you need to disassemble it and write it into the database. Seeing the BOM, it seems a bit ooxx.


You can see two programs in utf8_encode to test writing/removing BOM.

Add the written file content to the BOM

The code is as follows: Copy code

<? Php
Function writeUTF8File ($ filename, $ content)
{
$ F = fopen ($ filename, 'w ');
Fwrite ($ f, pack ("CCC", 0xef, 0xbb, 0xbf ));
Fwrite ($ f, $ content );
Fclose ($ f );
}
?>

Remove BOM function

The code is as follows: Copy code

<? Php
Function removeBOM ($ str = '')
{
If (substr ($ str, 0,3) = pack ("CCC", 0xef, 0xbb, 0xbf )){
$ Str = substr ($ str, 3 );
   }
Return $ str;
}
?>

Therefore, the above BOM = pack ("CCC", 0xef, 0xbb, 0xbf), so the method for removing BOM can use the above removeBOM function or one of the following:

■ Str_replace ("replace", '', $ bom_content );
■ Preg_replace ("/^ replace/", '', $ bom_content );
Also see to judge whether this string is a function of the UTF-8:

The code is as follows: Copy code

Function isUTF8 ($ string)
{
Return (utf8_encode (utf8_decode ($ string) ==$ string );
}

Use shell in linux

Before discussing in detail the problem of BOM detection and deletion in UTF-8 coding, we may try to warm up with an example:

The code is as follows: Copy code
Shell> curl-s http://www.111cn.net/| head-1 | sed-n l
When <! DOCTYPE html PUBLIC "-// W3C // dtd xhtml 1.0 Transitional
// EN "" http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "> $

As shown above, the first three bytes are 357, 273, and 277, which are the BOM of the octal component.

The code is as follows: Copy code
Shell> curl-s http://www.111cn.Net/| head-1 | hexdump-C
00000000 ef bb bf 3c 21 44 4f 43 54 59 50 45 20 68 74 6d |... <! DOCTYPE htm |
00000010 6c 20 50 55 42 4c 49 43 20 22 2d 2f 57 33 43 | l PUBLIC "-// W3C |
00000020 2f 2f 44 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 | // dtd xhtml 1.0 |
00000030 54 72 61 6e 73 69 74 69 6f 6e 61 6c 2f 2f 45 4e | Transitional // EN |
00000040 22 20 22 68 74 70 3a 2f 2f 77 77 77 2e 77 33 | "" http: // www. w3 |
00000050 2e 6f 72 67 2f 54 52 2f 78 68 74 6d 6c 31 2f 44 |. org/TR/xhtml1/D |
00000060 54 44 2f 78 68 74 6d 6c 31 2d 74 72 61 6e 73 69 | TD/xhtml1-transi |
00000070 74 69 6f 6e 61 6c 2e 64 74 64 22 3e 0d 0a | tional. dtd ">... |

As shown above, the first three bytes are EF, BB, and BF, which are the hexadecimal BOM. Note: When a third-party website page is used, examples cannot be always available. In actual project development, may face hundreds of thousands of text files, if there are a few files mixed into the BOM, it is difficult to notice, if there is no UTF-8 text file with BOM, you can use vi to write several articles. The related commands are as follows:

Set UTF-8 encoding:

The code is as follows: Copy code
: Sets fileencoding = UTF-8

Add BOM:

The code is as follows: Copy code
: Set bomb

Delete BOM:

The code is as follows: Copy code
: Set nobomb

Query BOM:

The code is as follows: Copy code
: Set bomb?

How to check BOM in UTF-8 coding?

The code is as follows: Copy code

Shell> grep-r-I-l $ '^ Records'/path how to delete BOM from UTF-8 encoding?

Shell> grep-r-I-l $ '^ rows'/path | xargs sed-I's/^ rows '//; Q'

Recommendation: If you use SVN, you can add relevant code to the pre-commit hook to prevent BOM.

The code is as follows: Copy code

#! /Bin/bash

REPOS = "$1"
TXN = "$2"

SVNLOOK =/usr/bin/svnlook

For FILE in $ ($ SVNLOOK changed-t "$ TXN" "$ REPOS" | awk '/^ [AU]/{print $ NF}'); do
If $ SVNLOOK cat-t "$ TXN" "$ REPOS" "$ FILE" | grep-q $ '^ then'; then
Echo "Byte Order Mark be found in $ FILE" 1> & 2
Exit 1
Fi
Done

Many shell commands are used in this article.

Method 3: Use the ultraedit editor to directly modify the document

Just save the empty line document in the BOM format.

The following figure shows the encoding format when ultraedit saves the document:

Figure 2

Select UTF8-no BOM in it to solve all problems

 

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.