1 basic knowledge of character encoding
Character encoding is one of the most basic and important knowledge in computer technology. If you lack relevant knowledge, please do it yourself. Only the most concise description is available here.
1.1 Character Encoding Overview
The so-called character code, that is? Digital representation of every word invented by man. The most classic ASCII encoding is the western invention of the English character encoding method, including 26 English letters, numbers, punctuation, special characters and so on. The problem is that the range of this encoding is 0-127 and only 128 characters can be encoded. When the computer came to other countries, it found that in addition to English, there were a large number of other languages, and the characters covered were far more than 128. To this end, countries began coding their own language, such as China's GBK, Japan's CJK, and so on.
This solves the problem that the ASCII encoding is not enough, but it brings another more serious problem. That is, each country's character encoding is not uniform, resulting in the inability to unify processing. Thus, the famous Unicode appears, with a very large range of Unicode encodings that can cover characters from all the languages of the world.
1.2 Distinguishing character set (Charset) and character encoding (Char Encoding)
These two terms are sometimes not differentiated, but understanding their differences is critical to understanding character encodings.
Code point
As we said earlier, assign a numeric ordinal to each character. For example, in the ASCII character set, the character A is assigned to number 65th, which means that the code point of a is 65. In an encoding specification, the set of all code points is the character set.
Character encoding
Character encoding is the binary storage format for code points. As in the previous example, in the ASCII character set, the code point for a is 65. And how does this 65 go with binary 0 and 1 sequences? This is the work of character encoding. In ASCII encoding, this 65 is stored as 01000001, occupying a total of one byte (8 bits).
Speaking of which, you may feel that the difference is not good, this is mainly because in our case the ASCII character set code point only one character encoding, that is, ASCII character encoding. This is not always the case in other character sets, such as the Unicode character set.
The Unicode character set, which specifies the code point for every character in the world, such as the code point in the Unicode character set of the English letter A is 65 (haha, this code point is compatible with ASCII), However, 65 of the storage format has many ways: for example, in the UTF-8 character encoding specification is stored as 8 bits: 01000001, and in UCS-16 is stored as 16 bits: 0000000001000001, In UCS-32, it is stored as 32 bits: 00000000000000000000000001000001.
In this case, it is understood that the Unicode character set corresponds to a number of different character encodings: utf-8,ucs-16,ucs-32 and so on.
The ASCII character set has only one encoding: ASCII character encoding.
The different encoding methods of the Unicode character set are created to accommodate different environments, for example, UTF-8 is used for network transmission, file storage, and UCS-16 are used as memory storage in order to facilitate fast unified computing.
Today, although the Unicode character set has been widely adopted, other character sets left over by history still exist in large numbers.
In recent years, the concept of character sets is seldom mentioned, and character encoding is more used.
1.3 Character encoding and display
The characters are encoded just to complete the storage, processing and transmission, to the shape of the characters to draw out, but also have a corresponding font and rendering means.
For GUI programs, the operating system provides APIs to render the specified characters. For the terminal, the terminal has a character encoding properties, so that the received binary byte stream according to this character encoding parsing, and then call the corresponding rendering engine to display it, for details, please refer to one of my blog post: from the call printf () to the display on the string.
2 vim read, display, save text file Process Analysis 2.1 Vim involves the character encoding
(1) Character encoding of disk files
The text files stored on the disk are saved according to a certain character encoding, and different files may use different character encodings.
This is called in Vim: fileencoding.
(2) Vim buffer and character encoding of interface
When Vim runs, its menus, labels, and individual buffers are uniformly used in a character encoding.
This is called in Vim: encoding.
(3) The character encoding used by the terminal
The terminal can only use one character encoding at the same time, and according to this encoding from the received byte stream recognition characters, and display, the terminal character encoding can be dynamically adjusted.
This is called in Vim: termencoding.
2.2 Vim reading, display and storage analysis
(1) Read the file
When Vim opens a file, it does not know the character encoding of the file, so it has to be probed. Probes are tested in a certain order of precedence. According to the standard is: fileencodings. Vim tests the character encoding specified by the fileencodings variable, until it finds that it is appropriate and sets the character encoding to the fileencoding variable.
The encoding in the file is then converted into a encoding specified encoding, which is stored in the file buffer.
(2) Display file
Vim after the file is read and stored in the Buffer memory encoding encoding, according to the termencoding specified terminal encoding, converted to termencoding encoding, write to the terminal. At this point, the terminal identifies each character according to its own encoding property, and invokes the rendering engine to draw to the screen.
(3) Save file
VIM converts the encoding encoded byte collection in the buffer into fileencoding encoding and writes to the disk, completing the file save.
As you can see, VIM involves the conversion between the 3 character encodings:
Read:fileencoding-–> encoding
Hin:encoding--> termencoding
Written by: encoding ——-> fileencoding
As long as there are no problems with these three conversions, VIM will work and no garbled characters will occur.
However, not all character encodings are capable of lossless conversions, such as when the GBK character encoding is converted to ASCII encoding, and the problem occurs because ASCII does not fully contain GBK characters.
3 Common garbled Case Analysis 3.1 read the file, vim detection fileencoding inaccurate
This is well understood, such as files stored in GBK encoding, VIM detects the fileencoding as ASCII, then there is definitely a problem.
"Solution" one is to rely on vim to improve the detection level, the second is to set the appropriate fileencodings variable, the most likely to use the encoding method to the front. If VIM is not really probing, then it can only be detected manually by the set FILEENCODING=XXX command.
3.2 fileencoding encoding does not convert correctly to encoding encoding
For example, the file is GBK encoded, and ecoding uses ASCII, so that a large number of Kanji characters cannot be converted, resulting in garbled text.
The "Workaround" sets encoding to UTF-8 so far UTF-8 can contain all the characters, so any other encoding can be converted to UTF-8 without a compromise.
3.3 encoding cannot be converted correctly to Termencoding
This problem is similar to 3.2.
The "Workaround" sets the termencoding to why the encoding is the same. The default termencoding= "" Case, the two are the same.
3.3 termencoding inconsistent with actual terminal character encoding
For example, the character terminal encoding attribute is GBK, and termencoding is UTF-8, then vim will mistakenly think that the terminal is UTF-8 encoding, resulting in output to the terminal UTF-8 encoded byte stream, and the terminal is in accordance with the GBK to identify, of course, will be identified as garbled.
The "solution" unifies the actual coding of the terminal with the termencoding of vim.
3.4 Lack of terminal display capability
For example, the traditional character terminal, itself does not have the ability to display Chinese characters, although it can recognize the UTF-8 encoded Chinese characters, but the rendering engine does not draw correctly, it is displayed as garbled.
"Solution" try to use pseudo-terminal software such as putty to avoid using character terminal equipment directly, and if it is unavoidable, avoid using characters other than the ASCII character set and learn English well.
4 Best practices for eliminating garbled characters
All encodings are set to Utf-8. This will not only recognize all human languages, but also avoid the loss of performance between the various encodings.
4.1 Vim Settings
set encoding=utf-8set termencoding=utf-8set fileencodings=utf-8,gbk,latin1
If there are no special requirements and restrictions, the disk files are also stored in UTF-8 mode.
set fileencoding=utf-8
4.2 Terminal Setup
Several terminal software settings are commonly used.
(1) Putty
(2) Mac Terminal
Let vim completely say goodbye to garbled characters