In Python processing Chinese, reading files or messages, if found garbled (string processing, read and write files, print), most people's practice is to call Encode/decode for debugging, and did not explicitly consider why garbled, today we discuss how to deal with the coding problem.
Note: The following discussion is python2.x version, not tested under PY3K
Errors that occur most frequently during debugging
Error 1
Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe6 in position 0:ordinal not in range (128)
Error 2
Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)
First of all
Must have a general concept to understand the next character set, character encoding
ASCII | Unicode | UTF-8 | Wait a minute
Character-coded notes: Ascii,unicode and UTF-8
STR and Unicode
Both STR and Unicode are subclasses of the basestring
So there's a way to judge if it's a string.
def is_str (s): Return Isinstance (S, basestring)
STR and Unicode conversions
STR---decode (' the_coding_of_str ')--Unicode Unicode---encode (' the_coding_you_want ')
Difference
STR is a string of bytes, consisting of Unicode encoded (encode) bytes.
How to declare
>>> s = ' Chinese ' = U ' Chinese '. Encode (' Utf-8 ') >>> type (' Chinese ') <type ' str ' >
Length (number of bytes returned)
>>> u ' Chinese '. Encode (' utf-8 ') ' \xe4\xb8\xad\xe6\x96\x87 ' >>> len (U ' Chinese ' encode (' Utf-8 ')) 6
Unicode is the real string, made up of characters
How to declare
>>> s = U ' chinese ' >>> s = ' Chinese '. Decode (' Utf-8 ') >>> s = Unicode (' Chinese ', ' utf-8 ') >>> type (U ' Chinese ') <type ' Unicode ' >
The length (the number of characters returned), which is really desired in logic
>>> u ' Chinese ' u ' \u4e2d\u6587 ' >>> len (U ' Chinese ') 2
Conclusion
Figure out whether to deal with STR or Unicode, using the right processing method (Str.decode/unicode.encode)
Here's how to judge if it's unicode/str
>>> isinstance (U ' Chinese ', Unicode) True >>> isinstance (' Chinese ', Unicode) False >>> isinstance ( ' Chinese ', str) True >>> isinstance (U ' Chinese ', str) False
Simple principle: Do not use encode to STR, do not use decode for Unicode (in fact, str can be encode, see Finally, in order to ensure simple, not recommended)
>>> ' Chinese ' encode (' Utf-8 ') Traceback (most recent call last): File "<stdin>", line 1, in <module> Unico Dedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range >>> u ' Chinese '. Decode (' Utf-8 ') Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)
Different encoding conversions, using Unicode as the intermediate encoding
#s是code_A的str s.decode (' code_a '). Encode (' Code_b ')
File handling, IDE and console
Process, which can be used as a sink, an entrance, an exit
Entrance, all into Unicode, the pool all using Unicode processing, export, and then into the target code (of course, there are exceptions, the processing logic to use the specific encoding case)
Read the file external input encoding, decode to Unicode processing (internal encoding, unified Unicode) encode to the desired target encoding to write to the target output (file or console)
The IDE and console error occurred because the encoding and IDE encoding inconsistencies caused
Output when the encoding is converted into a consistent output can be normal
>>> print u ' Chinese '. Encode (' GBK ')???? >>> print u ' Chinese '. Encode (' Utf-8 ') Chinese
Suggestions
Code Code
Unified coding to prevent garbled characters from a certain link
Environment coding, ide/text editor, file encoding, database data table encoding
Ensure code Source file encoding
This is important.
PY file default encoding is ASCII, in the source code file, if the use of non-ASCII characters, you need to code the file header to declare the document
If not declared, enter non-ASCII errors encountered, must be placed in the first row of the file or the second line
File "xxx.py", line 3 syntaxerror:non-ascii character ' \xd6 ' in file c.py on line 3, but no encoding declared; See http://www.python.org/peps/pep-0263.html for details
Declaring methods
#-*-Coding:utf-8-*-or #coding =utf-8
If the head is declared coding=utf-8, a = ' Chinese ' is encoded as Utf-8
If the head is declared coding=gb2312, a = ' Chinese ' is encoded as GBK
So, the header of all source files in the same project is one encoding, and the encoding to be declared is consistent with the encoding of the source file (Editor-related)
A hard-coded string that is used as processing in the source code, uniformly Unicode
Isolate the encoding of its type from the source file itself, independent of the convenience of each location in the process
if s = = U ' Chinese ': #而不是 s = = ' Chinese ' pass #注意这里 s When you're here, make sure to convert to Unicode
After the steps above, you only need to focus on two Unicode and the encoding you set (typically using UTF-8)
Processing order
1. Decode early 2. Unicode everywhere 3. Encode later
Related modules and some methods
Obtaining and setting the system default encoding
>>> Import sys >>> sys.getdefaultencoding () ' ASCII ' >>> reload (SYS) <module ' sys ' ( Built-in) > >>> sys.setdefaultencoding (' utf-8 ') >>> sys.getdefaultencoding () ' Utf-8 ' >> > Str.encode (' other_coding ')
In Python, a coded str is encode directly into another encoding str
The operation performed by #str_A为utf-8 Str_a.encode (' GBK ') is Str_a.decode (' Sys_codec '). Encode (' GBK ') here Sys_codec is the previous step Encoding of the sys.getdefaultencoding ()
' Get and set system default encoding ' and here's the Str.encode is related, but I generally rarely use this, mainly because the complex is not controlled, or the input is clear decode, the output of a clear encode come easy
Chardet
File encoding detection, download
>>> import chardet >>> f = open (' Test.txt ', ' r ') >>> result = Chardet.detect (F.read ()) >> > Result {' confidence ': 0.99, ' encoding ': ' Utf-8 '}
\u string to correspond to a Unicode string
>>> u ' in ' u ' \u4e2d ' >>> s = ' \u4e2d ' >>> print s.decode (' Unicode_escape ') >>& Gt A = ' \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529f ' >>> a.decode (' Unicode_escape ') u ' \u4fee\ u6539\u8282\u70b9\u72b6\u6001\u6210\u529f '