In this article, "Ha" is interpreted as an example to explain all the problems, the various encodings of "Ha" are as follows:
1. UNICODE (utf8-16), C854;
2. utf-8,e59388;
3. Gbk,b9fe.
I. STR and Unicode in Python
For a long time, the Chinese encoding in Python is a very big problem, often throw the code conversion exception, what is STR and Unicode in Python exactly what is it?
Unicode is referred to in Python and is generally referred to as a Unicode object.
For example, ' haha ', the Unicode object is
- U '/u54c8/u54c8 '
STR, which is a byte array, represents the format of the storage after encoding the Unicode object (which can be utf-8, GBK, cp936, GB2312).
Here it is just a stream of words, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
For example:
Here su is a Unicode object,
S_utf8 is a byte array that stores Unicode UTF8 encoded bytes, '/xe5/x93/x88/xe5/x93/x88 '
Similarly, S_GBK stores Unicode bytes that have been GBK encoded.
In print above, why print S_utf8 is garbled, and print S_GBK can be displayed in Chinese?
Because the print statement its implementation is going to output the content of the operating system, the operating system will encode the input byte stream according to the system encoding, which explains why the utf-8 format string "haha", the output is "Å 堝 搱", because '/xe5/x93/x88/xe5/x93 /x88 ' with GB2312 to explain, its display is "Å 堝 搱".
Here again, str records a byte array, just some encoding of the storage format, as to the output to the file or print out what format, completely depends on the decoding of its encoding to what it looks like.
Here's a little bit more on print: When a Unicode object is passed to print, the Unicode object is internally converted and converted to the default encoding of the cost (this is just a guess)
Ii. Conversion of STR and Unicode objects
The conversion of STR and Unicode objects, implemented by encode and decode, is used as follows:
Convert gbk ' haha ' to Unicode and then convert to UTF8
Third, set the default encoding setdefaultencoding
As shown in the demo code:
When the S (GBK string) is encoded directly into Utf-8, an exception is thrown, but by invoking the following code:
- Import Sys
- Reload (SYS)
- Sys.setdefaultencoding (' GBK ')
Can be converted to success, why?
In Python, in the encoding and decoding process, if one STR is encoded directly into another encoding, STR is decoded to Unicode, the encoding is the default encoding, and the general default encoding is ANSCII, So in the example code above the first time the conversion error, when the current default encoding is ' GBK ', there will be no error.
As for Reload (SYS), we need to reload the method because it will remove sys.setdefaultencoding after Python2.5 initialization.
Four, the operation of different files encoded format files
Create a file test.txt, file format with ANSI, content:
- ABC Chinese
Using Python to read
- # CODING=GBK
- Print open ("Test.txt"). Read ()
Results:
- ABC Chinese
Change the file format to UTF-8:
Results:
- ABC Juan Jennifer paced
Clearly, this needs to be decoded:
- # CODING=GBK
- Import Codecs
- Print open ("Test.txt"). Read (). Decode ("Utf-8")
Results:
- ABC Chinese
The above test.txt I was using editplus to edit, but when I use Windows to bring the Notepad editor and UTF-8 format,
Run Times Error:
- Traceback (most recent):
- File "chinesetest.py", line 3, in
- Print open ("Test.txt"). Read (). Decode ("Utf-8")
- Unicodeencodeerror: ' GBK ' codec can ' t encode character U '/ufeff ' in position 0:illegal multibyte sequence
It turns out that some software, such as Notepad, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a UTF-8 encoded file.
So we need to remove these characters when we read them, and the codecs module in Python defines this constant:
- # CODING=GBK
- Import Codecs
- data = open ("Test.txt"). Read ()
- If data[:3] = = codecs. Bom_utf8:
- data = Data[3:]
- Print Data.decode ("Utf-8")
Results:
- ABC Chinese
V. The role of the encoding format and the encoding declaration of the document
What does the encoding format of the source file do to the declaration of a string?
This problem has been bothering me for a long time, and now finally a little bit, the encoding format of the file determines the encoding format of the string that is declared in the source file, for example:
- str = ' haha '
- Print repr (str)
A. If the file format is Utf-8, the value of STR is: '/xe5/x93/x88/xe5/x93/x88 ' (haha utf-8 encoding)
B. If the file format is GBK, the value of STR is: '/xb9/xfe/xb9/xfe ' (haha GBK encoding)
As already mentioned in the first section, the string in Python is just a byte array, so when a case of STR output to the GBK encoded console, it will be displayed as garbled: Å 堝 搱, and when the B case of the STR output UTF-8 encoded console, will also show garbled problem, is nothing, Perhaps '/xb9/xfe/xb9/xfe ' with utf-8 decoding display, is blank bar. >_<
To finish the file format, now talk about the role of the code declaration, each file in the top place, will use the # CODING=GBK similar statements to declare the code, but what is the use of this statement? Until the end, I think it's a function of three:
A, the declaration source file will appear non-ASCII encoding, usually is Chinese;
b, in the Advanced IDE, the IDE will save your file format as you specify the encoding format.
C, determine the source code similar to U ' ha ' such declaration of the ' ha ' decoding into Unicode encoding format used, is also a relatively easy to confuse the place,
See Example:
- #coding: GBK
- ss = U ' haha '
- Print REPR (ss)
- print ' ss:%s '% SS
Save this code as a utf-8 text, run, what do you think it will output? Everyone first feel sure the output is definitely:
- U '/u54c8/u54c8 '
- SS: Haha
But the actual output is:
- U '/u935d/u581d/u6431 '
- SS: Å 堝 搱
Why this, this time, is the code statement in the mischief, in the Run SS = U ' haha ', the whole process can be divided into the following steps:
1) Get ' haha ' encoding: determined by the file encoding format, for '/xe5/x93/x88/xe5/x93/x88 ' (haha utf-8 encoded form)
2) When converting to Unicode encoding, in this conversion process, for the '/xe5/x93/x88/xe5/x93/x88 ' decoding, not with utf-8 decoding, but with the code specified at the declaration GBK, will '/xe5/x93/x88/xe5/x93/ X88 ' by GBK decoding, get is ' å 堝 搱 ',
The Unicode encoding of these three characters is U '/u935d/u581d/u6431 ', which can explain why the print repr (ss) output is U '/u935d/u581d/u6431 '.
OK, here's a bit of a detour, let's analyze the next example:
- #-*-Coding:utf-8-*-
- ss = U ' haha '
- Print REPR (ss)
- print ' ss:%s '% SS
This example is saved as GBK encoding, running the result, unexpectedly:
- Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb9 in position 0:unexpected code byte
Why is there a UTF8 decoding error here? Think of the last example and understand,
The first step of conversion, because the file encoding is GBK, get the ' haha ' encoding is GBK encoding '/xb9/xfe/xb9/xfe ',
When the second step, converted to Unicode, will use UTF8 to the '/xb9/xfe/xb9/xfe ' decoding, and we look at Utf-8 Code table will find that UTF8 encoding table (about UTF-8 interpretation can see character encoding notes: ASCII, UTF-8, UNICODE) does not exist at all, so the above error is reported.
In-depth analysis of Python Chinese garbled problem