In this article, the "Ha" is interpreted as an example to explain all the problems, and the various encodings of "Ha" are as follows:
1. UNICODE (utf8-16), C854;
2. utf-8,e59388;
3. Gbk,b9fe.
One, the STR and Unicode in Python
The Chinese encoding of Python has long been a big problem, often throwing the exception of encoding conversions, what is str and Unicode in Python?
Unicode is referred to in Python, generally refers to Unicode objects, such as ' haha ' Unicode object is
U ' \u54c8\u54c8 '
And STR, a byte array, represents the format of the storage after the Unicode object is encoded (which can be utf-8, GBK, cp936, GB2312). Here it is just a byte stream, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
For example:
To encode a Unicode object haha, encoded into a UTF-8 encoded Str-s_utf8,s_utf8 is a byte array that stores the ' \xe5\x93\x88\xe5\x93\ X88 ', but this is just a byte array, if you want to output it through the print statement to Haha, then you are disappointed, why?
Because the print statement its implementation is to output the content of the operating system, the operating system will be encoded according to the input stream of bytes encoded, which explains why the utf-8 format of the string "haha", the output is "鍝 堝 搱", because ' \xe5\x93\x88\xe5\x93 \x88 ' use GB2312 to explain, its display is "鍝 堝 搱". Here again, str records a byte array, just some encoded storage format, as to the output to a file or print out what format, completely depends on its decoding the encoding to decode it.
Here's a little bit more about print: When you pass a Unicode object to print, the Unicode object is converted internally and the default encoding of the cost is converted (This is just a guess).
Ii. Conversion of STR and Unicode objects
the conversion of STR and Unicode objects, implemented through encode and decode, is specifically used as follows:
Convert gbk ' haha ' to Unicode, then convert to UTF8
Third, setdefaultencoding
As shown in the demo code above:
When the S (GBK string) is encoded directly into Utf-8, an exception is thrown, but by invoking the following code:
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' GBK ')
Then you can convert to success, why? In python str and Unicode in the encoding and decoding process, if you encode a STR directly into another encoding, you will first decode STR to Unicode, the encoding is the default encoding, and the general default encoding is anscii , so there is an error in the first conversion in the example code above, and when you set the current default encoding of ' GBK ', there is no error.
As for Reload (SYS) because the Python2.5 will delete the Sys.setdefaultencoding method after initialization, we need to reload it .
Four, the operation of different documents encoded in the format of the file
Create a file test.txt, file format with ANSI, content:
ABC Chinese
Using Python to read
# CODING=GBK
Print open ("Test.txt"). Read ()
Result: ABC Chinese
Change the file format to UTF-8:
Result: ABC Juan PO
Obviously, you need to decode this:
# CODING=GBK
Import Codecs
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Result: ABC Chinese
The above Test.txt I edited with EditPlus, but when I use the Notepad editor with Windows as the UTF-8 format,
Run times wrong:
Traceback (most recent call last):
File "chinesetest.py", line 3, in
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte
It turns out that some software, such as Notepad, when saving a file encoded in UTF-8, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) where the file begins.
So we need to get rid of these characters when we read, and the codecs module in Python defines this constant:
# CODING=GBK
Import Codecs
data = open ("Test.txt"). Read ()
If data[:3] = = codecs. Bom_utf8:
data = Data[3:]
Print Data.decode ("Utf-8")
Result: ABC Chinese
V. the document encoding format and the role of the coding Declaration
What does the source file's encoding format do to the declaration of a string? This problem has been bothering me for a long time, and now finally a little bit. The encoding format of the file determines the encoded format of the string declared in the source file, for example:
str = ' haha '
Print repr (str)
A. if the file format is Utf-8, the value of STR is: ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoding)
B. if the file format is GBK, the value of STR is: ' \xb9\xfe\xb9\xfe ' (haha GBK encoding)
In the first section it has been said that the string in Python, is just a byte array, so when the Str of a case is output to the GBK encoded console, it will be displayed as garbled: 鍝 堝 搱, and when the str output in B case is UTF-8 encoded console, it will also show the garbled problem, is nothing, Perhaps ' \xb9\xfe\xb9\xfe ' with utf-8 decoding display, is blank bar. >_<
Speaking of the file format, now to talk about the role of coding statements , each file in the top place, will use the # CODING=GBK similar statements to declare the code, but what is the use of this statement? Until then, I think it works three:
- Declares that non-ASCII encoding will appear in the source file, usually in Chinese;
- In the Advanced IDE, the IDE saves your file format as you specify the encoding format.
- The code format used to decode the ' ha ' into Unicode, which is similar to the U ' ha ' in the source code, is also a more confusing place to see examples:
#coding: GBK
ss = U ' haha '
Print REPR (ss)
print ' ss:%s '% SS
Save this code as a utf-8 text and run, what do you think it will output? Everyone's first feeling sure the output is definitely:
U ' \u54c8\u54c8 '
SS: haha
But the output is actually:
U ' \u935d\u581d\u6431 '
SS:鍝 堝 搱
Why is this, at this time, is the code statement in the mischief, in the Running SS = U ' ha ', the whole process can be divided into the following steps:
1 get ' haha ' encoding: determined by file encoding format, for ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoded form)
2 when converted to Unicode encoding, in this conversion process, for ' \xe5\x93\x88\xe5\x93\x88 ' decoding, not with utf-8 decoding, but with the declaration code at the specified code GBK, will ' \xe5\x93\x88\ Xe5\x93\x88 ' GBK decoding, Get is ' 鍝 堝 搱 ', the Unicode encoding of these three words is U ' \u935d\u581d\u6431 ', which can explain why print repr (ss) output is U ' \u935d\u581d\ u6431 ' up.
Okay, here's a little detour, let's analyze the next example:
#-*-Coding:utf-8-*-
ss = U ' haha '
Print REPR (ss)
print ' ss:%s '% SS
This example is saved as a GBK encoding, running the result, unexpectedly:
Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb9 in position 0:unexpected code byte
Why is there a UTF8 decoding error here? Think of the previous example also understand that the first step in the conversion, because the file encoding is GBK, get the ' ha ' encoding is GBK encoding ' \xb9\xfe\xb9\xfe ', when the second step, converted to Unicode, will be used UTF8 to ' \xb9\xfe\xb9\ Xfe ' to decode, and the utf-8 coding table will find that the UTF8 encoding table (for UTF-8 interpretation can be seen in character-coded notes: ASCII, UTF-8, UNICODE) does not exist, so the above error is reported.