"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

Source: Internet
Author: User

Transferred from: http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code/

Background

The character encoding in Python is actually a bit complicated.

In addition, different development environments and tools, the display of logic and effect is not the same, especially, Chinese, novice, most often encountered:

(1) In Python's own ide:idle to toss the Chinese characters, the result is almost all garbled class things, such as: ' \xd6\xd0\xce\xc4′

(2) A Chinese character, print output to the Windows cmd command line, see is garbled

In this case, here is a special collation of these common phenomena, and the underlying causes of the phenomenon, and how to solve such problems.

Background knowledge

In fact, before looking at the following questions, it is better to understand the relevant background knowledge before it is easier to understand:

1. Basic knowledge of character encoding

For the character encoding itself, such as UTF-8,GBK and so on, unfamiliar, do not know what is, first to see:

Character encoding

The default in 2.Windows cmd is GBK encoding.

Those who do not know, also need to look first:

Windows command-line tools: cmd

In the following:

Set character encoding: Simplified Chinese gbk/english

3. About Idle

In fact, we have to know about it first:

Inside Python, the default character encoding is, according to the operating system, most of us are the Chinese system of Windows, the default is GBK encoding.

And in the idle, the direct input Chinese characters, actually is the GBK code.

The design of strings in 4.Python

The main is: Python 2.x in STR and Unicode, and, Python 3.x in the bytes and STR, between the logic, conversion, and difference.

Do not know, but also to see first:

Summary and comparison of character encodings in "grooming" python: Python 2.x str and Unicode vs Python 3.x bytes and str

FAQ: Idle See similar to ' \xd6\xd0\xce\xc4 ' instead of Chinese characters I want

Beginners, the most easily encountered problem is:

Chinese users, using Python's own idle, in the input Chinese, the results show, similar to:

' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '

Instead of expecting to see the output of the Chinese characters, such as:

The explanation for this behavior is:

In fact, here you, itself has got, right, default GBK encoded, Chinese string: "I am Chinese"

The It's just that:

Idle this, Python comes with IDE, not very good IDE, show you, its internal 16 binary value just.

1. For this point, you can use decode to verify:

Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32
Type "Copyright", "credits" or "license ()" For more information.
>>> "I am Chinese"
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '
>>> "I am Chinese". Decode ("GBK")
U ' \u6211\u662f\u4e2d\u6587′
>>>

Where the GBK string, after decoding, you can get a Unicode string, corresponding to the display is:

U ' \u6211\u662f\u4e2d\u6587′

Here's:

\u6211,\u662f,\u4e2d,\u6587, respectively, corresponds to four Chinese characters: "I", "yes", "Zhong", "Wen"

2. Some people ask, how do I know these values are corresponding to the four Chinese characters?

The answer is:

That's because you're not familiar with Unicode. and will not check Unicode tables.

I'll tell you before you read it:

Character encoding

And then refer to my:

HTML-related references

To check the Unicode value, you can find that the Unicode value for "Me" is 0x6211:

In the same vein, you can find the rest:

0x662f= "Yes" =\u662f

0x4e2d= "Zhong" =\u4e2d

0x6587= "Wen" =\u6587

3. Go back to the question above, and then you can then further verify that the previous string, indeed, is GBK:

Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32
Type "Copyright", "credits" or "license ()" For more information.
>>> "I am Chinese"
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '
>>> "I am Chinese". Decode ("GBK")
U ' \u6211\u662f\u4e2d\u6587′
>>> "I am Chinese". Decode ("GBK"). Encode ("GBK")
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '

That

The 16 binary values that were obtained directly from the Chinese characters before, and the 16 binary values that were decoded by GBK and then encoded as GBK, are the same

The Chinese character before the description is indeed a GBK encoding.

4. Alternatively, you can also see what the output of the UTF-8 is:

Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32
Type "Copyright", "credits" or "license ()" For more information.
>>> "I am Chinese"
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '
>>> "I am Chinese". Decode ("GBK")
U ' \u6211\u662f\u4e2d\u6587′
>>> "I am Chinese". Decode ("GBK"). Encode ("GBK")
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '
>>> "I am Chinese". Decode ("GBK"). Encode ("UTF-8")
' \xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87 '

So, summarize this issue:

Enter Chinese characters in idle, but display a value similar to ' \xd6\xd0\xce\xc4 ' instead of the desired Chinese character

The answer is:

In fact, it is already a Chinese character.

Only according to the current default is the GBK encoding, which shows the internal value of the GBK encoding.

In fact, a more definitive solution to this problem is:

Because idle is not very useful, so not recommended users, especially beginners, directly with the idle to develop Python.

It is recommended that you use:

notepad++ plus cmd

For specific reasons and explanations, see:

"Organize" "multi-figure" How to develop Python under windows: Run a python script under cmd, how to use the Python Shell (command line mode and GUI mode), how to use the Python IDE

A more definitive approach would be to:

This kind of common mistake belongs to the easy detour in learning Python.

And if you follow my tutorial to learn, not only can take a lot of detours, but also easier to understand a lot of basic logic:

Beginner's:

Python Beginner's tutorial: Getting Started

Mid-level:

Python Intermediate Tutorial: Development Summary

The high-level thematic description:

Python Featured Tutorials: string and character encodings

Python Tutorials: Crawling sites, simulating logins, crawling dynamic Web pages

FAQ: Chinese characters print output displayed to command line (cmd in Windows) garbled display

A phenomenon similar to the above phenomenon is:

When using Python code, print out a Chinese character to the command, but the result is garbled.

(1) Use the following code:

?
123456789101112131415161718192021222324252627282930313233343536373839404142 #!/usr/bin/python# -*- coding: utf-8 -*-"""-------------------------------------------------------------------------------[Function]【整理】Python中实际上已经得到了正确的Unicode或某种编码的字符,但是看起来或打印出来却是乱码http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code[Date]2013-07-19 [Author]Crifan Li [Contact]http://www.crifan.com/about/me/-------------------------------------------------------------------------------""" #---------------------------------import---------------------------------------#------------------------------------------------------------------------------def char_ok_but_show_messy():    """        Demo Python already got normal chinese char, with some encoding, but print to windows cmd show messy code    """    #此处,当前Python文件是UTF-8编码的,所以如下的字符串,是UTf-8编码的    cnUtf8Char = "我是UTF-8的中文字符串";    #所以,将UTF-8编码的字符串,打印输出到GBK编码的命令行(Windows的cmd)中,就会显示出乱码    print "cnUtf8Char=",cnUtf8Char; #cnUtf8Char= 鎴戞槸UTF-8鐨勪腑鏂囧瓧绗︿覆    #如果想要正确显示出中文字符,不显示乱码的话,则有两种选择:    #1. 把字符串转换为Unicode编码,则输出到GBK的命令行时,Python会自动将Unicode的字符串,编码为GBK,然后正确显示字符    decodedUnicodeChar = cnUtf8Char.decode("UTF-8");    print "decodedUnicodeChar=",decodedUnicodeChar; #decodedUnicodeChar= 我是UTF-8的中文字符串    #2. 让字符串的编码和输入目标(windows的cmd)的编码一致:把当前的字符串(由上述解码后得到的Unicode再次去编码)也变成GBK,然后输出到GBK的命令行时,就可以正确显示了    reEncodedToGbkChar = decodedUnicodeChar.encode("GBK");    print "reEncodedToGbkChar=",reEncodedToGbkChar; #reEncodedToGbkChar= 我是UTF-8的中文字符串   ###############################################################################if __name__=="__main__":    char_ok_but_show_messy();

Attention:

The file encoding for Python at this time is UTF-8.

Do not know, see:

The relationship between the file encoding declared with encoding and the actual encoding of the file in "grooming" python

(2) Current code download (right-click Save As):

char_ok_but_show_messy.py

(3) Restore phenomenon

The result of the operation is:

(4) explanation

The code has been explained very clearly.

No longer verbose.

Related Posts

And this kind of Python string coding related content, before there are more summaries:

"Summary" errors in the coding and decoding of common characters in Python 2.x and their solutions

"Grooming" tests for various scenarios in Python 3.x that automatically identify string encodings and correctly output in cmd

"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.