In this article, we will explain all the problems with 'har' as an example. The various encodings of 'har' are as follows:
2. Unicode (UTF8-16), c854;
2. UTF-8, e59388;
3. GBK, b9fe.
1. STR and Unicode in Python
For a long time, Chinese encoding in python is a very big problem. It often throws an encoding conversion exception. What is STR and Unicode in Python?
Unicode mentioned in Python generally refers to Unicode objects. For example, the Unicode object of 'haha 'is
U' \ u54c8 \ u54c8'
STR is a byte array, which represents the storage format after Unicode object encoding (which can be UTF-8, GBK, cp936, and gb2312. Here it is only a byte stream and has no other meaning. If you want to make the content displayed in this byte stream meaningful, you must use the correct encoding format to decode the display.
For example:
For Unicode object Haha encoding, encoding into a UTF-8 encoded str-s_utf8, s_utf8 is a byte array, storage is '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88 ', but this is just a byte array. If you want to output it to Haha through the print statement, you will be disappointed. Why?
Because the print statement transmits the output content to the operating system, the operating system will encode the input byte stream according to the system encoding, this explains why the UTF-8 string "Haha" outputs "too many characters ", because '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88' is explained by gb2312, it shows "too many characters ". Here, we will emphasize that STR records byte arrays, but only some encoding storage format. As for the format output to a file or printed out, it depends entirely on the way the decoded code decodes it.
Here we will add a note on print: When a unicode object is passed to print, the Unicode object will be converted once internally, default conversion cost (this is only my guess)
2. Conversion of STR and Unicode objects
The conversion between STR and Unicode objects is implemented through encode and decode. The specific usage is as follows:
Convert GBK 'haha 'to Unicode and then utf8
Iii. setdefaultencoding
As shown in the Demo code:
When the S (GBK string) is directly encoded as UTF-8, an exception is thrown, but the following code is called:
Import sys
Reload (sys)
SYS. setdefaultencoding ('gbk ')
The conversion is successful. Why? In python, if STR and Unicode are directly encoded into another encoding During encoding and decoding, STR is first decoded into Unicode, and the default encoding is used, generally, the default encoding is anscii. Therefore, an error occurs during the first conversion in the preceding sample code. If the current default encoding is set to 'gbk', no error will occur.
As for reload (sys), because the method SYS. setdefaultencoding will be deleted after python2.5 initialization, We need to reload it.
4. Operate on files of different encoding formats
Create a file named test.txt in ANSI format with the following content:
ABC Chinese
Read data using Python
# Coding = GBK
Print open ("test.txt"). Read ()
Result: ABC (Chinese)
The file format into UTF-8:
Result: ABC Juan
Obviously, decoding is required here:
# Coding = GBK
Import codecs
Print open ("test.txt"). Read (). Decode ("UTF-8 ")
Result: ABC (Chinese)
I used editplus to edit test.txt, but when I used the notepad that came with windows to edit and coexist in UTF-8 format,
Running error:
Traceback (most recent call last ):
File "chinesetest. py", line 3, in
Print open ("test.txt"). Read (). Decode ("UTF-8 ")
Unicodeencodeerror: 'gbk' codec can't encode character U' \ ufeff 'in position 0: Illegal multibyte Sequence
Originally, some software, such as Notepad, will insert three invisible characters (0xef 0xbb 0xbf, BOM) at the beginning of the file when saving a file encoded in UTF-8 ).
Therefore, we need to remove these characters during reading. The codecs module in Python defines this constant:
# Coding = GBK
Import codecs
Data = open ("test.txt"). Read ()
If data [: 3] = codecs. bom_utf8:
Data = data [3:]
Print data. Decode ("UTF-8 ")
Result: ABC (Chinese)
V. Functions of file encoding formats and declarations
What is the role of the source file encoding format on the declaration of strings? This problem has been bothering me for a long time, and now it is a bit eye-catching. The encoding format of the file determines the encoding format of the string declared in the source file, for example:
STR = 'haha'
Print repr (STR)
A. If the file format is UTF-8, the STR value is '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88' (haha UTF-8 encoding)
B. If the file format is GBK, the STR value is '\ xb9 \ xfe \ xb9 \ xfe' (haha GBK encoding)
As mentioned in section 1, strings in Python are only a byte array, so when STR in case a is output to the GBK encoding console, it will be garbled: zookeeper encoding; when 'str' in 'B' is output to the UTF-8-encoded console, garbled characters are displayed, maybe '\ xb9 \ xfe \ xb9 \ xfe' is displayed in UTF-8 decoding, Which is blank.> _ <
After the file format is completed, let's talk about the role of the encoding statement. Each file is at the top of the list and the # Coding = GBK statement is used to declare the encoding, but what is the purpose of this statement? Until now, I think it serves three purposes:
- The declared source file will contain non-ASCII encoding, which is usually Chinese;
- In advanced IDE, IDE saves your file format as the encoding format you specified.
- It is also confusing to determine that the encoding format used for decoding 'ha' such as u'ha' in the source code. For example:
# Coding: GBK
Ss = u'haha'
Print repr (SS)
Print 'ss: % s' % SS
Save the code as a UTF-8 text and run it. What do you think will be output? The first thing we can say is:
U' \ u54c8 \ u54c8'
SS: Haha
But the actual output is:
U' \ u935d \ u581d \ u6431'
SS: too many requests
Why is this? At this time, the encoding declaration is at a time. When running Ss = u'haha ', the entire process can be divided into the following steps:
1) Get the encoding of 'haha': determined by the file encoding format, which is '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88' (haha UTF-8 encoding format)
2) during conversion to unicode encoding, decoding of '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88' is not performed using UTF-8, instead, use the GBK code specified at the Declaration encoding, and decode '\ xe5 \ x93 \ x88 \ xe5 \ x93 \ x88' by GBK. Then, the obtained code is '', the Unicode encoding of these three characters is U' \ u935d \ u581d \ u6431 ', which can explain why print repr (SS) the output is U' \ u935d \ u581d \ u6431.
Well, here is a bit of a detour. Let's analyze the next example:
#-*-Coding: UTF-8 -*-
Ss = u'haha'
Print repr (SS)
Print 'ss: % s' % SS
Save this example as GBK encoding. The running result is:
Unicodedecodeerror: 'utf8' codec can't decode byte 0xb9 in position 0: Unexpected code byte
Why is there a UTF-8 decoding error? Think about the previous example and understand that the first step of the conversion is because the file encoding is GBK and the result is that 'haha 'encoding is GBK encoding' \ xb9 \ xfe \ xb9 \ xfe ', when the second step is converted to Unicode, utf8 is used to decode '\ xb9 \ xfe \ xb9 \ xfe'. You can check the UTF-8 encoding table and find that, the utf8 encoding table (for UTF-8 interpretations, see character encoding notes: ASCII, UTF-8, Unicode) does not exist at all, so the above error is reported.