Python program garbled on Windows terminal solution, python Terminal

Last Update:2015-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python program garbled on Windows terminal solution, python Terminal
Problem proposal

Recently, I moved a Python project to Windows to run it. It turns out that Chinese characters are garbled, and it runs well on Linux.

Haha, I have no love for Windows ....

Cause

Python program garbled characters in Windows terminal (cmd), which is a string encoding problem

Python file encoding

Python default script files are all ANSCII encoded. When there are characters in the file that are not within the ANSCII encoding range, use the "encoding indication" to correct them. In a module definition, if the. py file contains Chinese characters (strictly speaking, it contains non-anscii characters), you must specify the encoding declaration in the first or second line:

# -*- coding=utf-8 -*-

#coding=utf-8

Other codes, such as gbk and gb2312, can also be used. Otherwise, a similar error occurs.

SyntaxError: Non-ASCII character '/xe4' in file ChineseTest.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html  for details

Such exception information

In fact, Python only checks the #, coding, and encoding strings. Other characters are added for the sake of beauty. In addition, there are many characters available in Python, and there are many alias, not case sensitive, such as UTF-8 can be written as u8. See http://docs.python.org/library/codecs.html?standard-encodings.

In addition, it should be noted that the declared encoding must be consistent with the encoding used when the file is actually saved. Otherwise, an exception occurs in code parsing. Currently, the IDE will automatically handle this situation. After changing the declaration, it will replace it with the declared encoding for saving, but the text editor controller should be careful :)

Default encoding methodYou can use:

import sysprint sys.getdefaultencoding( )

Of course, the default encoding method is set:

import sysreload(sys)print sys.setdefaultencoding('utf-8')

The reason why reload sys is required is that the system deletes sys during loading. setdefaultencoding ('utf-8'). Therefore, you must re-load the reload sys module to call sys. setdefaultencoding ('utf-8') statement takes effect
This is the foundation. Next let's look at the encoding of the string ....

Str and unicode encoding

Both str and unicode are subclasses of basestring. Strictly speaking, str is actually a byte string, which is a sequence of unicode encoded bytes.
The str type is a sequence containing Characters represent (at least) 8-bit bytes; each unit of unicode is a unicode obj
Therefore, the len (u'china') value is 2; the len ('AB') value is also 2;

In The str document, The string data type is also used to represent arrays of bytes, e.g ., to hold data read from a file. that is to say, when reading the content of a file or reading the content from the network, the object is maintained as the str type. If you want to convert a str to a specific encoding type, you need to convert str to Unicode, and then convert it From unicode to a specific encoding type such as UTF-8 and gb2312;

When using the len () function for str 'hangzhou' encoded in the UTF-8, the result is 3 because, in fact, the UTF-8 encoded 'hangzhou' = '\ xE6 \ xB1 \ x89 '.
Unicode is a true string. It is obtained after correct character encoding for the byte string str, and len (u'han') = 1.

For example, the following code

# Coding = utf-8import sysif _ name _ = "_ main _" print sys. getdefaultencoding () # The default encoding is ascii s1 = 'Chinese' print type (s1) # str [UTF-8 encoding] print len (s1) #6 print s1 # garbled s2 = u'chinese' # unicode print type (s2) #2 print len (s2) # print s2 # exception, UnicodeEncodeError: 'ascii 'codec can't encode characters in position 0-1: ordinal not in range (128) # s2 is encoded as unicode, while the current file is encoded as UTF-8, python built-in default encoding is ascii

Let's take a look at the two basestring instance methods of encode () and decode (). After understanding the difference between str and unicode, the two methods will not be confused:

Encode and decode

For details about Python character encoding

First, there are many encoding methods available. For Chinese, gb2312, GBK, and gb18030 (including the most comprehensive Chinese characters)
At present, the commonly used encoding methods in the world are UTF-8. (After Chinese garbled characters are found, Baidu answer: Add coding to UTF-8 at the beginning of the py file)

-*-Conding: UTF-8-*-xxx. decode () decodes xxx from the encoding method in brackets to unicode xxx. encode () is to encode the unicode xxx according to the encoding method in brackets (

If xxx is not unicode, the system uses the default encoding/decoding method to decode xxx and then perform the preceding encoding operations)
Here are some examples of conversion encoding.

Unicode to gb2312, UTF-8, etc.

# Coding = UTF-8 # convert unicode to gb2312, if _ name _ = "_ main _": s = u 'China' s_gb = s. encode ('gb2312') print s_gb # UTF-8, GBK is converted to unicode using the function unicode (s, encoding) or s. decode (encoding) s = u 'China' # s converts unicode to UTF-8 s_utf8 = s. encode ('utf-8') assert (s_utf8.decode ('utf-8') = s) print s_utf8.decode ('utf-8 ')

Convert normal str to unicode

# Coding = utf-8import sys # convert normal str to unicodeif _ name _ = '_ main _': # print sys exception if you stare at the following lines. getdefaultencoding () reload (sys) sys. setdefaultencoding ("UTF-8") print sys. getdefaultencoding () s = 'China' su = u'china' print s # garbled print su # Do not garbled # s converts unicode to UTF-8 first # Because s is the location. py (#-*-coding = UTF-8-*-) file encoding should be UTF-8 # Use sys. setdefaultencoding ("UTF-8") sets the character encoding to UTF-8 s_unicode = s. decode ('utf-8') assert (s_uni Code = su) print s_unicode # s to convert it to gb2312. You need to convert it to unicode before converting it to gb2312 print s. decode ('utf-8 '). encode ('gb2312') # If you directly execute s. what happens to encode ('gb2312? Print s. encode ('gb2312 ')

Comparison

# Coding = utf-8import sysif _ name _ = '_ main _': s = 'China' print s # If you directly execute s. what happens to encode ('gb2312? Print s. encode ('gb2312 ')

An exception occurs here:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

Python automatically decodes s to unicode and then encodes it into gb2312. Because the decoding is automatically performed by python, we do not specify the decoding method, python will use the method specified by sys. defaultencoding to decode. In many cases, sys. defaultencoding is ANSCII. If s is not of this type, an error occurs.
Taking the above information as an example,My sys. defaultencoding is anscii, and the s encoding method is the same as the file encoding method, which is utf8, so an error occurred.: UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128)
In this case, we have two ways to correct the error:
1. clearly indicate the encoding method of s.

#! /Usr/bin/env python #-*-coding: UTF-8-*-s = 'Chinese' s. decode ('utf-8'). encode ('gb2312 ')

Second, change sys. defaultencoding to the file encoding method.

# Coding = utf-8import sysif _ name _ = '_ main _': reload (sys) sys. setdefaultencoding ("UTF-8") s = 'China' print s # If you directly execute s. what happens to encode ('gb2312? Print s. encode ('gb2312 ')

Pyton internal code

First of all, we need to clarify that the representation of a string in Python is unicode encoding. Therefore, during encoding conversion, unicode is usually used as the intermediate encoding, that is, decode the other encoded strings into unicode, and then convert the unicode encoding (encode) into another encoding.

In some ides, the output of strings is always garbled or even incorrect. In fact, the IDE result output console itself cannot display the encoding of strings, rather than the program itself.

# Coding = utf-8import sys if _ name _ = "_ main _": # the file must be encoded with s. the encoding specified by decode ('utf8') is the same. Otherwise, the decoding exception will be thrown. # You can use s. decode ("gbk", "ignore") or s. decode ("gbk", "replace. Print sys. getdefaultencoding () # The default encoding is ascii s = 'China' # The file is encoded as a UTF-8, so no exception s. decode ('utf8') # decodes the default unicode character to utf8 print s # exception UnicodeDecodeError: 'gbk' codec can't decode bytes in position 2-3: illegal multibyte sequence # s. decode ('gbk') # encode the s of a unicode string according to the GBK encoding method in brackets into GBK # print s # exception UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128) # s. encode ('gbk') # print s. decode ('gbk', "ignore") # decodes s from UTF-8 according to gbk encoding. ignore the abnormal encoding and only display the valid encoding print s. decode ('gbk', 'replace ') # replace the abnormal encoding. It is possible that the encoding of those characters has a problem with print s.

The encoding of this file must be the same as the encoding specified by s. decode ('utf8'). Otherwise, an exception message will be thrown.

File encoding and print Functions

Create a file named test.txt in ANSI format with the following content:

Abc Chinese

Use the following python code to read

#coding=gbkprint open("Test.txt").read()

Result output

Abc Chinese

The file format into UTF-8:
Result:

Abc Juan

Obviously, decoding is required here.

# coding=gbkimport codecsprint open("Test.txt").read().decode("utf-8")

Result:

Abc Chinese

I used Editplus to edit test.txt, but when I used the notepad that came with Windows to edit and coexist in UTF-8 format,
Running error:

Traceback (most recent call last):  File "ChineseTest.py", line 3, in <module>    print open("Test.txt").read().decode("utf-8")UnicodeEncodeError: 'gbk' codec can't encode character u'/ufeff' in position 0: illegal multibyte sequence

Originally, some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, BOM) at the beginning of the file when saving a file encoded in UTF-8 ).
Therefore, we need to remove these characters during reading. The codecs module in python defines this constant:

# coding=gbkimport codecsdata = open("Test.txt").read()if data[:3] == codecs.BOM_UTF8: data = data[3:]print data.decode("utf-8")

Result:

Abc Chinese

Summary

Windows terminal output method at noon

Clearly indicates the output of Chinese Characters in s encoding Mode

# Coding = utf-8import sysif _ name _ = "_ main _": # reload (sys) # sys. setdefaultencoding ("UTF-8") # The following two codes are equivalent to print "Chinese ". decode ("UTF-8") print u "Chinese" assert ("Chinese ". decode ("UTF-8") = u "") # The following two codes are equivalent to print "". decode ("UTF-8 "). encode ("GBK") # decodes the character into utf8 and encodes it into GBK print u" ". encode ("GBK") # u "XXX" defines the character as a unicode character. The unicode encoded character is directly encoded as GBK, and then the output # Same as print "Chinese ". decode ("UTF-8 "). encode ('gb2312') print u "Chinese ". encode ('gb2312') print "Chinese ". decode ("UTF-8 "). encode ('cp936') print u "Chinese ". encode ('cp936') print u 'Chinese '. encode ('utf-8 '). decode ('utf-8') # If you directly execute s. encode ('gb2312') exception print "China ". encode ('gb2312 ')

Change sys. defaultencoding to output Chinese characters in the file encoding mode.

The above output method is also applicable in this way.

# Coding = utf-8import sysif _ name _ = "_ main _": # When the string encoding is not UTF-8 print sys. getdefaultencoding () reload (sys) sys. setdefaultencoding ("UTF-8") print sys. getdefaultencoding () # If you directly execute s. encode ('gb2312') print "China ". encode ('gb2312 ')

Summary

Solution:

1. the u'china' format is used directly to indicate unicode encoding. The decoding method is based on the encoding method defined by # coding at the top. If this encoding method is not specified, it is recommended to write # coding as the current encoding method of the operating system, because the operating system encoding and source file encoding are often different. This method is recommended.

2. Specify that the decoding method print 'is'. decode ("utf8") during output. It must be consistent with the saved encoding. Ignore # coding definition.

3. if you change # coding and save encoding to the same encoding as the operating system, you can directly print 'normal output', which is not recommended because you need to know the operating system encoding in advance, an error occurs when the operating system code is different when copied to another computer.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More