About character encoding

Source: Internet
Author: User
Tags blank page ultraedit
Open the notepad.exe program, create a text file, the content is a strict word, in turn using ANSI, Unicode, Unicodebigendian and UTF-8 encoding method to save. Then, use the hexadecimal function in the text editing software UltraEdit to observe the internal encoding mode of the file. 1) ANSI: The file encoding is two bytes.

Open notepad.exe, the Notepad program, and create a new text file. The content is a strict character, which is saved in sequence using ANSI, Unicode, Unicode big endian, and UTF-8 encoding. Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding mode of the file. 1) ANSI: The file encoding is two bytes.

Open notepad.exe, the Notepad program, and create a new text file. The content is a strict character, which is saved in sequence using ANSI, Unicode, Unicode big endian, and UTF-8 encoding.

Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding mode of the file.

1) ANSI: The file encoding is two bytes: "D1 CF", which is exactly the "strict" GB2312 encoding. This also implies that GB2312 is stored in a big-headed manner.

2) Unicode: the encoding is four bytes: "ff fe 25 4E", where "ff fe" indicates that it is stored in Small Header mode, and the actual encoding is 4E25.

3) Unicode big endian: the encoding format is four bytes: "fe ff 4E 25", and "fe ff" indicates that it is stored in the big data storage mode.

4) UTF-8: the encoding is six bytes "ef bb bf E4 B8 A5", the first three bytes "ef bb bf" indicates this is UTF-8 encoding, the last three "E4B8A5" are "strict" encoding, and their storage sequence is consistent with the encoding sequence.

BOM occupies three bytes of the UTF-8-encoded file. If you use NotePad to save a text file as a UTF-8 encoding method, open the file with UE, switch to the hexadecimal editing status, you can see the beginning of FFFE. This is a good way to identify the UTF-8 encoding file, the software through BOM to identify whether the file is UTF-8 encoding, many software also requires that the file to be read must carry BOM. However, there are still a lot of software that cannot recognize BOM.

PHP did not consider the BOM issue during design, that is, he would not ignore the three characters at the beginning of the BOM in a UTF-8-encoded file.

Because it must be seen on the Bo-Blog wiki, the Bo-Blog that uses PHP is also troubled by BOM. Another problem was mentioned: "restricted by the COOKIE sending mechanism, BOM files already exist at the beginning of these files, the COOKIE cannot be sent (because PHP has already sent a file header before sending the COOKIE), so the login and logout functions are invalid. All functions dependent on cookies and sessions are invalid ." This should be the reason why a blank page appears in the Wordpress background. Because any executed file contains BOM, all three characters will be sent, resulting in invalid functionality relying on cookies and sessions.

The solution is to save the file as an ASCII code if it only contains English characters (or characters in ASCII code. With the UE Editor, click File> convert> UTF-8 to ASCII, or select ASCII encoding in Save. If it is a line tail character in DOS format, you can open it in notepad, click Save As, and select ASCII encoding. If it contains Chinese characters, you can use the Save as function of UE, select "UTF-8 without BOM.

Of course, how can this problem be solved during processing in a Python program?

Method 1: directly modify the format of the source file, save it as a UTF-8, but many times the source file passed over can not be modified, so it does not work.

Method 2: replace \ xef \ xbb \ xbf with null.

VIM hexadecimal mode and text mode switch

After opening the file with vim, run the following command:

: %! Xxd is displayed in hexadecimal format;: %! Xxd-r returns text display.

Only the modification of the hexadecimal part will be used. The changes to the printable text on the right are negligible.

Vim open a file in binary format

vim -b binfile

Save files in BOM-free UTF-8 format in Vim

View file formats

Generally, we need to first check whether the file format is the same as expected, and then decide whether to modify the file format based on the results (of course, you can directly modify the file format without knowing the original format ), the following lists the file encoding and BOM commands.

# View the file encoding. Set fenc? # Check whether BOM is included. Set bomb?

Modify file format

# Set to UTF-8 encoding. Set fenc = UTF-8 # set to non-BOM. If you need to set to BOM, use "set bomb ". Set nobomb # Add the BOM mark set bomb

Delete BOM from UTF-8 encoding using linux commands

shell> grep -r -I -l $'^\xEF\xBB\xBF' /path | xargs sed -i 's/^\xEF\xBB\xBF//;q'orshell> grep -r -I -l $'^\xEF\xBB\xBF' /path | xargs sed -i 's/^\xEF\xBB\xBF//g'orshell> tail -c +4 old_file > new_file

If you use SVN to submit code, you can add the relevant code to the pre-commit hook to prevent BOM.

#!/bin/bashREPOS="$1"TXN="$2"SVNLOOK=/usr/bin/svnlookfor FILE in $($SVNLOOK changed -t "$TXN" "$REPOS" | awk '/^[AU]/ {print $NF}'); do    if $SVNLOOK cat -t "$TXN" "$REPOS" "$FILE" | grep -q $'^\xEF\xBB\xBF'; then        echo "Byte Order Mark be found in $FILE" 1>&2        exit 1    fidone

UTF-8 without BOM is the standard form!
With BOM UTF-8 is naked rogue !!!

BOM representation of different codes
Encoding Hexadecimal Representation
UTF-8 EF BB BF
UTF-16-BE FE FF
UTF-16-LE FF FE
UTF-32-BE 00 00 FE FF
UTF-32-LE Ff fe 00 00
UTF-7 2B 2F 76 and one of the following Bytes: [38 39 2B 2F]
En: UTF-1 F7 64 4C
En: UTF-EBCIC DD 73 66 73
En: Standard Compression Scheme for Unicode 0E FE FF
En: BOCU-1 Fb ee 28 and may follow FF
GB-18030 84 31 95 33

Vim converts a file from the dos format to the unix format

:set fileformat=unix:w

VIM status bar display file format with bom
Show fileencoding and bomb in the status line

Http://vim.wikia.com/wiki/Show_fileencoding_and_bomb_in_the_status_line

For example, [latin1], [iso-8859-15], [UTF-8, B], etc.

if has("statusline") set statusline=%<%f\ %h%m%r%=%{\"[\".(&fenc==\"\"?&enc:&fenc).((exists(\"+bomb\")\ &&\ &bomb)?\",B\":\"\").\"]\ \"}%k\ %-14.(%l,%c%V%)\ %Pendif

In fact, this section is useful:

%{\"[\".(&fenc==\"\"?&enc:&fenc).((exists(\"+bomb\")\ &&\ &bomb)?\",B\":\"\").\"]\ \"}

Original article address: About character encoding. Thank you for sharing it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.