Transferred from: http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code/
Background
The character encoding in Python is actually a bit complicated.
In addition, different development environments and tools, the display of logic and effect is not the same, especially, Chinese, novice, most often encountered:
(1) In Python's own ide:idle to toss the Chinese characters, the result is almost all garbled class things, such as: ' \xd6\xd0\xce\xc4′
(2) A Chinese character, print output to the Windows cmd command line, see is garbled
In this case, here is a special collation of these common phenomena, and the underlying causes of the phenomenon, and how to solve such problems.
Background knowledge
In fact, before looking at the following questions, it is better to understand the relevant background knowledge before it is easier to understand:
1. Basic knowledge of character encoding
For the character encoding itself, such as UTF-8,GBK and so on, unfamiliar, do not know what is, first to see:
Character encoding
The default in 2.Windows cmd is GBK encoding.
Those who do not know, also need to look first:
Windows command-line tools: cmd
In the following:
Set character encoding: Simplified Chinese gbk/english
3. About Idle
In fact, we have to know about it first:
Inside Python, the default character encoding is, according to the operating system, most of us are the Chinese system of Windows, the default is GBK encoding.
And in the idle, the direct input Chinese characters, actually is the GBK code.
The design of strings in 4.Python
The main is: Python 2.x in STR and Unicode, and, Python 3.x in the bytes and STR, between the logic, conversion, and difference.
Do not know, but also to see first:
Summary and comparison of character encodings in "grooming" python: Python 2.x str and Unicode vs Python 3.x bytes and str
FAQ: Idle See similar to ' \xd6\xd0\xce\xc4 ' instead of Chinese characters I want
Beginners, the most easily encountered problem is:
Chinese users, using Python's own idle, in the input Chinese, the results show, similar to:
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' |
Instead of expecting to see the output of the Chinese characters, such as:
The explanation for this behavior is:
In fact, here you, itself has got, right, default GBK encoded, Chinese string: "I am Chinese"
The It's just that:
Idle this, Python comes with IDE, not very good IDE, show you, its internal 16 binary value just.
1. For this point, you can use decode to verify:
Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32 Type "Copyright", "credits" or "license ()" For more information. >>> "I am Chinese" ' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' >>> "I am Chinese". Decode ("GBK") U ' \u6211\u662f\u4e2d\u6587′ >>> |
Where the GBK string, after decoding, you can get a Unicode string, corresponding to the display is:
U ' \u6211\u662f\u4e2d\u6587′ |
Here's:
\u6211,\u662f,\u4e2d,\u6587, respectively, corresponds to four Chinese characters: "I", "yes", "Zhong", "Wen"
2. Some people ask, how do I know these values are corresponding to the four Chinese characters?
The answer is:
That's because you're not familiar with Unicode. and will not check Unicode tables.
I'll tell you before you read it:
Character encoding
And then refer to my:
HTML-related references
To check the Unicode value, you can find that the Unicode value for "Me" is 0x6211:
In the same vein, you can find the rest:
0x662f= "Yes" =\u662f
0x4e2d= "Zhong" =\u4e2d
0x6587= "Wen" =\u6587
3. Go back to the question above, and then you can then further verify that the previous string, indeed, is GBK:
Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32 Type "Copyright", "credits" or "license ()" For more information. >>> "I am Chinese" ' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' >>> "I am Chinese". Decode ("GBK") U ' \u6211\u662f\u4e2d\u6587′ >>> "I am Chinese". Decode ("GBK"). Encode ("GBK") ' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' |
That
The 16 binary values that were obtained directly from the Chinese characters before, and the 16 binary values that were decoded by GBK and then encoded as GBK, are the same
The Chinese character before the description is indeed a GBK encoding.
4. Alternatively, you can also see what the output of the UTF-8 is:
Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32 Type "Copyright", "credits" or "license ()" For more information. >>> "I am Chinese" ' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' >>> "I am Chinese". Decode ("GBK") U ' \u6211\u662f\u4e2d\u6587′ >>> "I am Chinese". Decode ("GBK"). Encode ("GBK") ' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 ' >>> "I am Chinese". Decode ("GBK"). Encode ("UTF-8") ' \xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87 ' |
So, summarize this issue:
Enter Chinese characters in idle, but display a value similar to ' \xd6\xd0\xce\xc4 ' instead of the desired Chinese character
The answer is:
In fact, it is already a Chinese character.
Only according to the current default is the GBK encoding, which shows the internal value of the GBK encoding.
In fact, a more definitive solution to this problem is:
Because idle is not very useful, so not recommended users, especially beginners, directly with the idle to develop Python.
It is recommended that you use:
notepad++ plus cmd
For specific reasons and explanations, see:
"Organize" "multi-figure" How to develop Python under windows: Run a python script under cmd, how to use the Python Shell (command line mode and GUI mode), how to use the Python IDE
A more definitive approach would be to:
This kind of common mistake belongs to the easy detour in learning Python.
And if you follow my tutorial to learn, not only can take a lot of detours, but also easier to understand a lot of basic logic:
Beginner's:
Python Beginner's tutorial: Getting Started
Mid-level:
Python Intermediate Tutorial: Development Summary
The high-level thematic description:
Python Featured Tutorials: string and character encodings
Python Tutorials: Crawling sites, simulating logins, crawling dynamic Web pages
FAQ: Chinese characters print output displayed to command line (cmd in Windows) garbled display
A phenomenon similar to the above phenomenon is:
When using Python code, print out a Chinese character to the command, but the result is garbled.
(1) Use the following code:
?
123456789101112131415161718192021222324252627282930313233343536373839404142 |
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
-------------------------------------------------------------------------------
[Function]
【整理】Python中实际上已经得到了正确的Unicode或某种编码的字符,但是看起来或打印出来却是乱码
http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code
[Date]
2013-07-19
[Author]
Crifan Li [Contact]
http://www.crifan.com/about/me/
-------------------------------------------------------------------------------
"""
#---------------------------------import---------------------------------------
#------------------------------------------------------------------------------
def char_ok_but_show_messy():
"""
Demo Python already got normal chinese char, with some encoding, but print to windows cmd show messy code
"""
#此处,当前Python文件是UTF-8编码的,所以如下的字符串,是UTf-8编码的
cnUtf8Char
= "我是UTF-8的中文字符串"
;
#所以,将UTF-8编码的字符串,打印输出到GBK编码的命令行(Windows的cmd)中,就会显示出乱码
print "cnUtf8Char="
,cnUtf8Char;
#cnUtf8Char= 鎴戞槸UTF-8鐨勪腑鏂囧瓧绗︿覆
#如果想要正确显示出中文字符,不显示乱码的话,则有两种选择:
#1. 把字符串转换为Unicode编码,则输出到GBK的命令行时,Python会自动将Unicode的字符串,编码为GBK,然后正确显示字符
decodedUnicodeChar
= cnUtf8Char.decode(
"UTF-8"
);
print "decodedUnicodeChar="
,decodedUnicodeChar;
#decodedUnicodeChar= 我是UTF-8的中文字符串
#2. 让字符串的编码和输入目标(windows的cmd)的编码一致:把当前的字符串(由上述解码后得到的Unicode再次去编码)也变成GBK,然后输出到GBK的命令行时,就可以正确显示了
reEncodedToGbkChar
= decodedUnicodeChar.encode(
"GBK"
);
print "reEncodedToGbkChar="
,reEncodedToGbkChar;
#reEncodedToGbkChar= 我是UTF-8的中文字符串
###############################################################################
if __name__
=
=
"__main__"
:
char_ok_but_show_messy();
|
Attention:
The file encoding for Python at this time is UTF-8.
Do not know, see:
The relationship between the file encoding declared with encoding and the actual encoding of the file in "grooming" python
(2) Current code download (right-click Save As):
char_ok_but_show_messy.py
(3) Restore phenomenon
The result of the operation is:
(4) explanation
The code has been explained very clearly.
No longer verbose.
Related Posts
And this kind of Python string coding related content, before there are more summaries:
"Summary" errors in the coding and decoding of common characters in Python 2.x and their solutions
"Grooming" tests for various scenarios in Python 3.x that automatically identify string encodings and correctly output in cmd
"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled