"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

Last Update:2016-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code/

Background

The character encoding in Python is actually a bit complicated.

In addition, different development environments and tools, the display of logic and effect is not the same, especially, Chinese, novice, most often encountered:

(1) In Python's own ide:idle to toss the Chinese characters, the result is almost all garbled class things, such as: ' \xd6\xd0\xce\xc4′

(2) A Chinese character, print output to the Windows cmd command line, see is garbled

In this case, here is a special collation of these common phenomena, and the underlying causes of the phenomenon, and how to solve such problems.

Background knowledge

In fact, before looking at the following questions, it is better to understand the relevant background knowledge before it is easier to understand:

1. Basic knowledge of character encoding

For the character encoding itself, such as UTF-8,GBK and so on, unfamiliar, do not know what is, first to see:

Character encoding

The default in 2.Windows cmd is GBK encoding.

Those who do not know, also need to look first:

Windows command-line tools: cmd

In the following:

Set character encoding: Simplified Chinese gbk/english

3. About Idle

In fact, we have to know about it first:

Inside Python, the default character encoding is, according to the operating system, most of us are the Chinese system of Windows, the default is GBK encoding.

And in the idle, the direct input Chinese characters, actually is the GBK code.

The design of strings in 4.Python

The main is: Python 2.x in STR and Unicode, and, Python 3.x in the bytes and STR, between the logic, conversion, and difference.

Do not know, but also to see first:

Summary and comparison of character encodings in "grooming" python: Python 2.x str and Unicode vs Python 3.x bytes and str

FAQ: Idle See similar to ' \xd6\xd0\xce\xc4 ' instead of Chinese characters I want

Beginners, the most easily encountered problem is:

Chinese users, using Python's own idle, in the input Chinese, the results show, similar to:

' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '

Instead of expecting to see the output of the Chinese characters, such as:

The explanation for this behavior is:

In fact, here you, itself has got, right, default GBK encoded, Chinese string: "I am Chinese"

The It's just that:

Idle this, Python comes with IDE, not very good IDE, show you, its internal 16 binary value just.

1. For this point, you can use decode to verify:

Where the GBK string, after decoding, you can get a Unicode string, corresponding to the display is:

U ' \u6211\u662f\u4e2d\u6587′

Here's:

\u6211,\u662f,\u4e2d,\u6587, respectively, corresponds to four Chinese characters: "I", "yes", "Zhong", "Wen"

2. Some people ask, how do I know these values are corresponding to the four Chinese characters?

The answer is:

That's because you're not familiar with Unicode. and will not check Unicode tables.

I'll tell you before you read it:

Character encoding

And then refer to my:

HTML-related references

To check the Unicode value, you can find that the Unicode value for "Me" is 0x6211:

In the same vein, you can find the rest:

0x662f= "Yes" =\u662f

0x4e2d= "Zhong" =\u4e2d

0x6587= "Wen" =\u6587

3. Go back to the question above, and then you can then further verify that the previous string, indeed, is GBK:

Python 2.7.3 (default, APR, 23:24:47) [MSC v.1500-bit (AMD64)] on Win32
Type "Copyright", "credits" or "license ()" For more information.
>>> "I am Chinese"
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '
>>> "I am Chinese". Decode ("GBK")
U ' \u6211\u662f\u4e2d\u6587′
>>> "I am Chinese". Decode ("GBK"). Encode ("GBK")
' \xce\xd2\xca\xc7\xd6\xd0\xce\xc4 '

That

The 16 binary values that were obtained directly from the Chinese characters before, and the 16 binary values that were decoded by GBK and then encoded as GBK, are the same

The Chinese character before the description is indeed a GBK encoding.

4. Alternatively, you can also see what the output of the UTF-8 is:

So, summarize this issue:

Enter Chinese characters in idle, but display a value similar to ' \xd6\xd0\xce\xc4 ' instead of the desired Chinese character

The answer is:

In fact, it is already a Chinese character.

Only according to the current default is the GBK encoding, which shows the internal value of the GBK encoding.

In fact, a more definitive solution to this problem is:

Because idle is not very useful, so not recommended users, especially beginners, directly with the idle to develop Python.

It is recommended that you use:

notepad++ plus cmd

For specific reasons and explanations, see:

"Organize" "multi-figure" How to develop Python under windows: Run a python script under cmd, how to use the Python Shell (command line mode and GUI mode), how to use the Python IDE

A more definitive approach would be to:

This kind of common mistake belongs to the easy detour in learning Python.

And if you follow my tutorial to learn, not only can take a lot of detours, but also easier to understand a lot of basic logic:

Beginner's:

Python Beginner's tutorial: Getting Started

Mid-level:

Python Intermediate Tutorial: Development Summary

The high-level thematic description:

Python Featured Tutorials: string and character encodings

Python Tutorials: Crawling sites, simulating logins, crawling dynamic Web pages

FAQ: Chinese characters print output displayed to command line (cmd in Windows) garbled display

A phenomenon similar to the above phenomenon is:

When using Python code, print out a Chinese character to the command, but the result is garbled.

(1) Use the following code:

123456789101112131415161718192021222324252627282930313233343536373839404142 #!/usr/bin/python# -*- coding: utf-8 -*-"""-------------------------------------------------------------------------------[Function]【整理】Python中实际上已经得到了正确的Unicode或某种编码的字符，但是看起来或打印出来却是乱码http://www.crifan.com/python_already_got_correct_encoding_string_but_seems_print_messy_code[Date]2013-07-19 [Author]Crifan Li [Contact]http://www.crifan.com/about/me/-------------------------------------------------------------------------------""" #---------------------------------import---------------------------------------#------------------------------------------------------------------------------def char_ok_but_show_messy(): """ Demo Python already got normal chinese char, with some encoding, but print to windows cmd show messy code """ #此处，当前Python文件是UTF-8编码的，所以如下的字符串，是UTf-8编码的 cnUtf8Char = "我是UTF-8的中文字符串"; #所以，将UTF-8编码的字符串，打印输出到GBK编码的命令行（Windows的cmd）中，就会显示出乱码 print "cnUtf8Char=",cnUtf8Char; #cnUtf8Char= 鎴戞槸UTF-8鐨勪腑鏂囧瓧绗︿覆 #如果想要正确显示出中文字符，不显示乱码的话，则有两种选择： #1. 把字符串转换为Unicode编码，则输出到GBK的命令行时，Python会自动将Unicode的字符串，编码为GBK，然后正确显示字符 decodedUnicodeChar = cnUtf8Char.decode("UTF-8"); print "decodedUnicodeChar=",decodedUnicodeChar; #decodedUnicodeChar= 我是UTF-8的中文字符串 #2. 让字符串的编码和输入目标（windows的cmd）的编码一致：把当前的字符串(由上述解码后得到的Unicode再次去编码)也变成GBK，然后输出到GBK的命令行时，就可以正确显示了 reEncodedToGbkChar = decodedUnicodeChar.encode("GBK"); print "reEncodedToGbkChar=",reEncodedToGbkChar; #reEncodedToGbkChar= 我是UTF-8的中文字符串 ###############################################################################if __name__=="__main__": char_ok_but_show_messy();

Attention:

The file encoding for Python at this time is UTF-8.

Do not know, see:

The relationship between the file encoding declared with encoding and the actual encoding of the file in "grooming" python

(2) Current code download (right-click Save As):

char_ok_but_show_messy.py

(3) Restore phenomenon

The result of the operation is:

(4) explanation

The code has been explained very clearly.

No longer verbose.

And this kind of Python string coding related content, before there are more summaries:

"Summary" errors in the coding and decoding of common characters in Python 2.x and their solutions

"Grooming" tests for various scenarios in Python 3.x that automatically identify string encodings and correctly output in cmd

"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Grooming" Python has actually got the correct Unicode or some coded characters, but it looks or prints garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support