Reply content:
On the coding and garbled problems, I briefly say.
People who usually ask such questions are confusing a number of different concepts, and they are unaware of their own confusion.
- Terminal display character encoding (Windows under Terminal is Cmd,linux are various terminal, Telnet is putty or Xshell)
- The code for the shell environment. For example, the Chinese version of Windows is GBK (backwards compatible with gb2312), most Linux distributions use Utf-8 (LANG=ZH_CN. UTF-8).
- The encoding of the text file. This usually depends on your editor, and some editors support multiple encodings, and you can specify the editor to use a specific encoding at the beginning of the text. For example #-*-Coding:utf8-*-,vim See this line by default the script is identified as UTF-8 compatible encoding format.
- The internal encoding of the application. A string, as the data is just a byte array, but as an array of characters, there is an analytic way. The internal character encoding for Java and Python is that both Utf-16,python and Java support the use of different encodings to decode byte arrays to get character arrays.
Take the question of the Lord to explain.
I did the same experiment in the default terminal in Ubuntu Kylin Chinese, but the result is the opposite of the main problem:
Did you see that?
The Lord and I have not lied, for what?
Because
unicode("汉字","gb2312")
I think the key is to distinguish between "byte" and "character" concept, but also know a little bit of common sense of the font.
"character" can be regarded as an abstract concept, such as when the landlord said "kanji", in fact, he meant to express the concept of a two characters.
When the characters are represented in the computer, they need to be encoded into binary (bytes), so there are different encoding methods, such as GBK, UTF-8 and so on. As Kenneth shows, the characters "kanji" are encoded as 0xbabad7d6 in the GBK and encoded as 0xe6b189e5ad97 in UTF-8.
At the end of the display, the abstract characters are converted into images based on the font used.
Therefore, the first problem of the landlord is that although you see the "Kanji" image, but its source file in the script, the byte encoding may be any one-under Windows is GBK or GB18030 and so on. So Python sees a string of gbk/gb18030 encoded bytes, and you're trying to tell Python that it's UTF-8 encoded, and that's a natural error.
The second problem, which is not familiar to SQL Server, seems to be that when you put the data read from the database (in bytes, possibly GBK, etc.) into the variables of the unit, the program mistakenly interprets non-Unicode encoded bytes as Unicode encodings. So the idea should be to figure out what the data is encoded when it is read (which may be related to the encoding of the data at the time it was deposited, or it might be related to the database configuration), and what conversions the program made when it was stored in the unit.