Vim file coding recognition and garbled processing
in Vim, there are four encoding-related options: Fileencodings, fileencoding, encoding, and termencoding. In actual use, any one of the options error, will cause garbled. Therefore, each Vim user should be clear about the meaning of these four options. Let's take a look at the meanings and effects of these four options in detail. 1 encoding www.2cto.com encoding is the character encoding used within Vim. After we set up the encoding, all of the buffer, registers, strings in the script, etc. in Vim all use this encoding. When Vim is working, if the encoding is inconsistent with its internal code, it will first convert the encoding to internal encoding. If the code in the work contains characters that cannot be converted to internal encoding, the characters will be lost. Therefore, when choosing the internal code of Vim, be sure to use a code that is strong enough to avoid affecting normal work. Since the encoding option involves the internal representation of all the characters in Vim, it can only be set once when Vim is started. modifying encoding during VIM work can cause a lot of problems. If there is no specific reason, always set encoding to Utf-8. To avoid garbled menus and system prompts in non-UTF-8 systems such as Windows, you can do these settings at the same time: set Encoding=utf-8set langmenu=zh_cn. Utf-8language message ZH_CN. UTF-82 Termencoding termencoding is the encoding that Vim uses for screen display, and when displayed, VIM converts the internal code to screen encoding and then to the output. When an internal encoding contains a character that cannot be converted to a screen encoding, the character becomes a question mark, but does not affect the editing operation on it. If termencoding is not set, direct use of encoding does not convert. For example, when you log in to a Linux workstation via telnet under Windows, because Windows Telnet is GBK encoded, and Linux uses UTF-8 encoding, you will be garbled in Vim under Telnet. At this time there are two ways to eliminate garbled characters: one is to change the encoding Vim to GBK, the other way is to keep encoding for Utf-8, the termencoding to GBK, let Vim in the display when the transcoding. Obviously, when you use the previous method, these characters are lost if you encounter characters that are not represented by GBK in the edited file. However, if you use the latter method, these characters cannot be displayed due to terminal restrictions, but these characters are not lost during editing. for GVim under the graphical interface, its display is not dependent on term, so termencoding has no meaning for it. In the GVim under GTK2, termencoding is always utf-8 and cannot be modified. GVim under Windows ignores the existence of termencoding. www.2cto.com 3 fileencoding When Vim reads a file from a disk, it detects the encoding of the file. If the file is encoded in a different way from the internal encoding of vim, VIM converts the encoding. When the conversion is complete, Vim sets the fileencoding option to the encoding of the file. When Vim is saved, if encoding and fileencoding are different, Vim will encode and convert. Therefore, by setting fileencoding after opening the file, we can convert the file from one encoding to another. However, as can be seen from the previous introduction, fileencoding is automatically set when the file is opened and detected by Vim. Therefore, if garbled, we cannot correct garbled characters by re-setting fileencoding after opening the file. The automatic recognition of 4 fileencodings encoding is implemented by setting fileencodings, and attention is in the plural form. Fileencodings is a comma-delimited list in which each item in the list is an encoded name. When we open the file, VIM attempts to decode it in sequence using the encoding in fileencodings, and if it succeeds, decodes it using that encoding, sets the fileencoding to this value, and, if it fails, continues to experiment with the next encoding. Therefore, when we set up the fileencodings, we must put the strict requirements, when the file is not the code is more prone to decoding the failure of the encoding method put in front, the loose encoding method is put in the back. For example, Latin1 is a very loose encoding, any encoding of the resulting text, decoded with latin1, will not occur decoding failure-whenHowever, the result of decoding is naturally the "garbled". Therefore, if you put latin1 in the first place of the fileencodings, open any Chinese file is garbled is also taken for granted. Www.2cto.com The following is a recommended fileencodings setting for Yunnan Fox: Set FILEENCODINGS=UCS-BOM,UTF-8,CP936,GB18030,BIG5,EUC-JP, Euc-kr,latin1 of which, Ucs-bom is a very strict code, not the encoded file is almost impossible to be misjudged as Ucs-bom, so put in the first place. utf-8 is also quite strict, except for very short documents (for example, many people relish the GBK encoded "Unicom" is misjudged as UTF-8 Coding Classic error), in real life the general document is almost impossible to be misjudged, so put in the second place. Next is cp936 and GB18030, these two codes are relatively loose, if put in front, there will be a lot of miscarriage, so let them rely on the latter. CP936 's coding space is smaller than GB18030, so put cp936 in front of the GB18030. As for Big5, EUC-JP and EUC-KR, they are very strict and cp936 almost, put them behind, in the editing of these encoded files, there must be a lot of miscalculation, but this is the Vim built-in code detection mechanism has no way to solve. Since Chinese users rarely have the opportunity to edit these encoded files, we have decided to use the cp936 and GB18030 prerequisites to ensure that these codes are identified. Finally, it's latin1. It's a very loose code, so we have to put it on the last one. Unfortunately, when you encounter a real latin1 encoded file, most of the time, it does not have the opportunity to fall-back to latin1, often in the previous code is misjudged. However, as previously mentioned, Chinese users do not have much access to such documents. www.2cto.com If the code is misjudged, the decoded result cannot be recognized by humans, so we say that this file is garbled. At this point, if you know the correct encoding of this file, you can open the file by using the ++enc=encoding way to open the file, such as: E ++enc=utf-8 myfile.txt5 fencview According to the previous introduction, we know that through Vim built-in encoding recognition mechanism, the recognition rate isVery low, especially for the recognition between Simplified Chinese (gbk/gb18030), Traditional Chinese (Big5), Japanese (EUC-JP), and Korean (EUC-KR). For the average user, it is also very unrealistic to see how a file is encoded by the naked eye. Therefore, the Yunnan Fox strongly recommends the Mbbill development of the Fencview plug-in for the water-wood community. The plug-in uses word frequency statistics to identify the encoding, the correct rate is very high. Click here to download http://www.vim.org/scripts/script.php?script_id=1708.
Goto: Vim file encoding identification and garbled processing