How to troubleshoot UTF-8 file garbled characters in VIM

Last Update:2014-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Basic Knowledge

In Vim, there are four encoding-related options: fileencodings, fileencoding, encoding, and termencoding. In actual use, any option error may cause garbled characters. Therefore, each Vim user should clarify the meaning of these four options. The following describes in detail the meanings and functions of these four options.

(1) encoding
Encoding is the internal character encoding method used by Vim. After encoding is set, all the buffer, registers, and strings in the script in Vim use this encoding. When Vim is working, if the encoding method is inconsistent with its internal encoding, it will first convert the encoding to the internal encoding. If the encoding used for work contains characters that cannot be converted to internal encoding, these characters will be lost. Therefore, when selecting the Vim internal encoding, you must use an encoding with sufficient performance to avoid affecting normal operations.
Since the encoding option involves the internal representation of all characters in Vim, it can only be set once when Vim is started. Modifying encoding in Vim may cause many problems. In the user manual, it is recommended to change its value only in. vimrc. In fact, it seems only meaningful to change its value in. vimrc. If there is no special reason, set encoding to UTF-8. To avoid garbled menus and system prompts in non-UTF-8 systems such as Windows, you can also make these settings:
Set encoding = UTF-8
Set langmenu = zh_CN.UTF-8
Language message zh_CN.UTF-8

(2) termencoding
Termencoding is the code Vim uses for screen display. During display, Vim converts the internal code to screen encoding and then outputs the code. When the internal encoding contains a character that cannot be converted to screen encoding, the character becomes a question mark, but the editing operation is not affected. If termencoding is not set, encoding is directly used without conversion.
For example, when you log on to the Linux workstation via telnet in Windows, because Windows telnet is GBK encoded, and Linux uses UTF-8 encoding, garbled characters are displayed in Vim in telnet. At this time, there are two ways to eliminate Garbled text: one is to change Vim's encoding to gbk, the other is to keep encoding As UTF-8, and change termencoding to gbk, transcode Vim when it is displayed. Obviously, when using the previous method, if the edited file contains characters that cannot be expressed by GBK, these characters will be lost. However, if the last method is used, although these characters cannot be displayed due to terminal limitations, they will not be lost during editing.
For GVim in the graphic interface, its display does not depend on the TERM, so termencoding has no meaning for it. In GVim under GTK2, termencoding is always UTF-8 and cannot be modified. In Windows, GVim ignores the existence of termencoding.

(3) fileencoding
When Vim reads a file from a disk, it detects the file encoding. If the file encoding method is different from the Vim internal encoding method, Vim converts the encoding method. After the conversion, Vim sets the fileencoding option to the file encoding. If the encoding and fileencoding are different when Vim stores the disk, Vim performs encoding conversion. Therefore, by setting fileencoding after opening the file, we can convert the file from one encoding to another encoding. However, we can see from the previous introduction that fileencoding is automatically set when the file is opened and tested by Vim. Therefore, in case of garbled characters, we cannot correct the garbled characters by setting fileencoding again after opening the file.
In short, fileencoding is the character encoding method of the file currently edited in Vim. When saving the file, Vim also saves the file as this encoding method (whether new files are used or not ).

(4) fileencodings
The automatic identification of encoding is implemented by setting fileencodings. Note that it is in the plural form. Fileencodings is a list separated by commas (,). Each item in the list is an encoded name. When we open the file, VIM uses the encoding in fileencodings to try decoding. If it succeeds, it uses this encoding method and sets fileencoding to this value, if the Code fails, test the next encoding.
Therefore, when setting fileencodings, we must put the encoding method that is strictly required and is more prone to decoding failures when the file is not encoded, put the loose encoding method at the end. For example, latin1 is a very loose encoding method. The text obtained by any encoding method is decoded using latin1 and will not fail to be decoded.-Of course, the decoded results are naturally "garbled ". Therefore, if you put latin1 at the top of fileencodings, opening any Chinese file is garbled.

The following is a fileencodings setting recommended on the Internet:

Set fileencodings = ucs-bom, UTF-8, cp936, gb18030, big5, euc-jp, euc-kr, latin1
Among them, the ucs-bom is a very strict encoding. files without this encoding are hardly mistaken for the ucs-bom, so they are placed first.
UTF-8 is also quite strict, in addition to very short files (for example, many people relish the GBK encoding of the "Unicom" was misjudged as a classic error of UTF-8 encoding ), in real life, files are almost impossible to be misjudged, so they are placed in the second place.
The following are cp936 and gb18030. These two types of codes are relatively loose. If we put them in front, there will be a lot of misjudgment, So let them back. The encoding space of cp936 is smaller than that of gb18030, so cp936 is placed before gb18030.
As for big5, euc-jp, and euc-kr, they are strictly the same as cp936. Put them behind them and there will inevitably be a lot of misjudgment when editing these encoded files, but this is a problem that Vim's built-in encoding detection mechanism cannot solve. Since Chinese users rarely have the opportunity to edit these encoding files, we decided to put cp936 and gb18030 in front to ensure the identification of these encoding.
Finally, latin1. It is an extremely loose code, so we have to put it in the last place. Unfortunately, when you encounter a file with latin1 encoding, in most cases, it does not have the opportunity to fall-back to latin1, which is often mistaken in the previous encoding. However, as mentioned earlier, Chinese users do not have much access to such files.
If the encoding is wrong, the decoded results won't be recognized by humans, so we can say that this file is garbled. If you know the correct encoding of the file, you can open the file by using ++ enc = encoding when opening the file, for example:
: E ++ enc = UTF-8 myfile.txt

2. How Vim works

Well, I have explained this pile of parameters that will easily confuse new users. Let's take a look at how Vim's multi-character encoding method supports work.
(1) Start Vim and set the encoding mode of the buffer, menu text, and message text based on the encoding value set in. vimrc.
(2) read the file to be edited and test the file encoding method one by one based on the character encoding methods listed in fileencodings. And set fileencoding to the detected character encoding method. In fact, the test accuracy of Vim is not high, especially when encoding is not set to UTF-8. Therefore, we strongly recommend that you set encoding to UTF-8, although it may cause another minor problem if you want Vim to display chinese menus and prompt messages.
(3) Compare fileencoding and encoding values. If they are different, call iconv to convert the file content to the character encoding method described by encoding, and put the converted content in the buffer opened for this file. Now we can edit this file. Note: To complete this step, you need to call the external iconv. dll (note 2). You need to ensure that this file exists in $ VIMRUNTIME or other columns in the PATH environment variable directory.
(4) When saving the file after editing, compare the values of fileencoding and encoding again. If different, call iconv again to convert the text in the buffer to the character encoding method described by fileencoding, and save it to the specified file. Similarly, you need to call iconv. dll

3. solution example

(1) Method 1: Set the. vimrc file:
Add two sentences under/home/username/. vimrc or/root/. vimrc:
Let & termencoding = & encoding
Set fileencodings = UTF-8, gbk, ucs-bom, cp936
This approach enables editing of UTF-8 files

(2) Method 2: After opening the file, set in the vi Editor:
: Set encoding = UTF-8 termencoding = gbk fileencoding = UTF-8

(3) method 3: Create a UTF-8 file, in the vi editor settings:
: Set fenc = UTF-8
: Set enc = GB2312
In this way, enter Chinese in the editor and save the file as a UTF-8.

(4) Method 4: A recommended ~ /. Vimrc file Configuration:
Set encoding = UTF-8
Set fileencodings = ucs-bom, UTF-8, cp936, gb18030, latin1
Set termencoding = gb18030
Set expandtab
Set ts = 4
Set shiftwidth = 4
Set nu
Syntax on

If has ('mouse ')
Set mouse-=
Endif

Postscript: This article is based on relevant information on the Internet. Due to the large number of sources, the source cannot be identified one by one. Please forgive me.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to troubleshoot UTF-8 file garbled characters in VIM

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to troubleshoot UTF-8 file garbled characters in VIM

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support