Python2.7 Chinese character encoding. What encoding format is used when Unicode is used?

Source: Internet
Author: User
0 reply content: I will briefly discuss the encoding and Garbled text.

People who usually ask such questions confuse several different concepts, and they are not aware that they have obfuscated these concepts.

  1. Encoding of characters displayed on the terminal (cmd for the terminal in windows, terminal for linux, putty or xshell for remote logon)
  2. The code of the shell environment. For example, the Chinese version of windows uses gbk (backward compatible with gb2312), most linux distributions use UTF-8 (LANG = zh_CN.UTF-8 ).
  3. Encoding of text files. This usually depends on your editor, and some editors support multiple encodings, you can specify the editor to use a specific encoding at the beginning of the text. For example, #-*-coding: utf8-*-. vim determines this script as UTF-8 compatible encoding format by default.
  4. The internal code of the application. A string, as data is only a byte array, but as a character array, there is a parsing method. The internal character encoding of java and python is UTF-16, and both python and java support different encoding to decode the byte array to get the character array.

Let's explain the problem of the subject.

I did the same experiment in the default terminal of ubuntu kylin Chinese environment, but the result is exactly the opposite to that of the subject:



See no?

Neither the subject nor I lie. Why?
Because

Unicode ("Chinese character", "gb2312 ")
I think the key is to distinguish between the concept of "Byte" and "character", and to know a little bit of common sense about font.

"Character" can be regarded as an abstract concept. For example, when the landlord says "Chinese characters", it actually means two characters that represent such a concept.

When a character is represented in a computer, it needs to be encoded into a binary (byte), so there is a different encoding method, such as GBK, UTF-8, etc. For example, the two characters, "Chinese character", are encoded as 0xBABAD7D6 in GBK and 0xE6B189E5AD97 in UTF-8.

In the final display, abstract characters are converted into concrete images based on the fonts used.

Therefore, the first problem for the landlord is that although you see images of "Chinese characters, however, the bytecode in the source file of the script may be any type-GBK or GB18030 in Windows. So what python sees is a string of GBK/GB18030 encoded bytes, and you try to tell python that it is UTF-8-encoded, it naturally reports an error.

The second problem is that you are not familiar with SQL Server, but it seems that when you put the data read from the database (in byte format, it may be non-Unicode encoding such as GBK) into the unit variable, the program incorrectly interprets non-Unicode encoded bytes as Unicode encoding. The Troubleshooting logic should be to find out what encoding the data is reading (this may be related to the encoding when the data is stored or the database configuration ), and what conversions are performed by the program during unit storage.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.