A good article on STR and Unicode
To sort out the Python code-related content
Note: The following discussion is for the python2.x version, py3k to be tried
Begin
When handling Chinese in Python, read files or messages, HTTP parameters, and so on
A run, found garbled (string processing, read-write file, print)
Then, most people's practice is to invoke Encode/decode for debugging, and not to think clearly why garbled
So the most common errors that occur when debugging
Error 1
Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe6 in position 0:ordinal not in range (128)
Error 2
Traceback (most recent): File "<stdin>", line 1, in <module> File "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line, in Decode return Codecs.utf_8_decode (input, error S, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)
First of all
Must have a general concept, understand the next character set, character encoding
ASCII | Unicode | UTF-8 | Wait a minute
Character encoding notes: Ascii,unicode and UTF-8
Taobao Search Technology Blog-Chinese coding
STR and Unicode
Both STR and Unicode are basestring subclasses.
So there's a way to judge whether it's a string
def is_str (s): Return Isinstance (S, basestring)
STR and Unicode conversions
Decode document
Encode document
STR-> decode (' the_coding_of_str ')-> Unicode Unicode-> encode (' the_coding_you_want ')-> str
Difference
STR is a byte string composed of bytes encoded by Unicode (encode)
Declaration mode
s = ' Chinese ' s = U ' Chinese '. Encode (' utf-8 ') >>> type (' Chinese ') <type ' str ' >
Length (number of bytes returned)
>>> u ' Chinese ' encode (' utf-8 ') ' \xe4\xb8\xad\xe6\x96\x87 ' >>> len (U ' Chinese '. Encode (' Utf-8 ')) 6
Unicode is the true string, made up of characters
Declaration mode
s = U ' chinese ' s = ' Chinese '. Decode (' utf-8 ') s = Unicode (' Chinese ', ' utf-8 ') >>> type (U ' Chinese ') <type ' Unicode ' >
To find the length (number of characters to return), the logical
>>> u ' Chinese ' u ' \u4e2d\u6587 ' >>> len (U ' Chinese ') 2
Conclusion
Figuring out whether to deal with STR or Unicode, using the right approach (Str.decode/unicode.encode)
Here's how to judge if it's unicode/str.
>>> isinstance (U ' Chinese, Unicode) True >>> isinstance (' Chinese ', Unicode) False >>> isinstance (' Chinese ', STR) True >>> isinstance (U ' Chinese ', str) False
Simple principle: Do not use encode for STR, do not use decode for Unicode (in fact, str can be encode, specifically in the end, in order to ensure simplicity, not recommended)
>>> ' Chinese '. Encode (' Utf-8 ') traceback (most recent call last): File "<stdin>", line 1, in <module> Unico Dedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128) >>> u ' Chinese '. Decode (' U Tf-8 ') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/system/library/frameworks" /python.framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line, in Decode return Codecs.utf_8_decode ( Input, errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)
Different encoding conversions, using Unicode as intermediate encoding
#s是code_A的str s.decode (' code_a '). Encode (' Code_b ')
File processing, IDE and console
Processing process, you can use this to think of Python as a pool, a portal, an exit
Entrance, all converted to Unicode, the pool all using Unicode processing, exit, and then converted to target encoding (of course, with exceptions, processing logic to use the specific encoding of the case)
Read file external input encoding, decode to Unicode processing (internal encoding, unified Unicode) encode to target output (file or console)
The IDE and console have an error because the encoding and the IDE's own encoding are inconsistent during print
Converts the encoding to a consistent output with normal output
>>> print u ' Chinese '. Encode (' GBK ')???? >>> print u ' Chinese '. Encode (' Utf-8 ') Chinese
Suggestions
Canonical encoding
Unified coding, to prevent a link generated by the garbled
Environment code, ide/text editor, file encoding, database data table coding
Ensure code Source file encoding
This is important.
The default encoding for the py file is ASCII, and in the source code file, if non-ASCII characters are used, the document needs to be encoded in the head of the file
If not stated, input non-ASCII will encounter errors that must be placed in the first or second line of the file
File "xxx.py", line 3 syntaxerror:non-ascii character ' \xd6 ' in file c.py on line 3, but no encoding declared; Http://www.python.org/peps/pep-0263.html for details
Declaring methods
#-*-Coding:utf-8-*-or #coding =utf-8
If the head statement coding=utf-8, a = ' Chinese ' Its encoding is utf-8
If the head statement coding=gb2312, a = ' Chinese ' Its encoding is GBK
So, all source files in the same project have a single encoding for the header, and the code to be declared is consistent with the code saved by the source file (Editor-related)
hard-coded strings used for processing in source code, uniform Unicode
Isolate its type from the encoding of the source file itself, and be independent and easy to handle at all locations in the process
if s = U ' Chinese ': #而不是 s = = ' Chinese ' pass #注意这里 s come here, make sure to become Unicode
After a few steps, you only need to focus on two Unicode and the encoding you set (generally using utf-8)
Processing order
1. Decode early 2. Unicode everywhere 3. Encode later
Related modules and some methods
Get and set system default encoding
>>> Import sys >>> sys.getdefaultencoding () ' ASCII ' >>> reload (SYS) <module ' sys ' (built-i N) > >>> sys.setdefaultencoding (' utf-8 ') >>> sys.getdefaultencoding () ' Utf-8 '
Str.encode (' other_coding ')
In Python, you encode a coded str directly into another encoding str
The operation performed by the #str_A为utf-8 str_a.encode (' GBK ') is Str_a.decode (' Sys_codec '). Encode (' GBK ') here Sys_codec is the last step Encoding of Sys.getdefaultencoding ()
' Get and set the system default code ' is related to the str.encode here, but I usually rarely use it, mainly to feel complex uncontrollable, or input clear decode, the output clear encode come easy (personal view)
Chardet
File encoding detection, downloading
>>> import chardet >>> f = open (' Test.txt ', ' r ') >>> result = Chardet.detect (F.read ()) >> > Result {' confidence ': 0.99, ' encoding ': ' Utf-8 '}
\u string to corresponding Unicode string
>>> u ' in ' u ' \u4e2d ' >>> s = ' \u4e2d ' >>> print s.decode (' Unicode_escape ') >>> a = ' \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529f ' >>> a.decode (' Unicode_escape ') u ' \u4fee\u6539\ u8282\u70b9\u72b6\u6001\u6210\u529f '
The above is the Python code processing data collation, follow-up continue to supplement the relevant information, thank you for your support of this site!