Deep analysis of Python Chinese garbled problem

Deep analysis of Python Chinese garbled problem _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this article, the "Ha" is interpreted as an example to explain all the problems, and the various encodings of "Ha" are as follows:
1. UNICODE (utf8-16), C854;
2. utf-8,e59388;
3. Gbk,b9fe.
One, the STR and Unicode in Python
The Chinese encoding of Python has long been a big problem, often throwing the exception of encoding conversions, what is str and Unicode in Python?
Unicode is referred to in Python, generally refers to Unicode objects, such as ' haha ' Unicode object is
U ' \u54c8\u54c8 '
And STR, a byte array, represents the format of the storage after the Unicode object is encoded (which can be utf-8, GBK, cp936, GB2312). Here it is just a byte stream, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
For example:

To encode a Unicode object haha, encoded into a UTF-8 encoded Str-s_utf8,s_utf8 is a byte array that stores the ' \xe5\x93\x88\xe5\x93\ X88 ', but this is just a byte array, if you want to output it through the print statement to Haha, then you are disappointed, why?

Because the print statement its implementation is to output the content of the operating system, the operating system will be encoded according to the input stream of bytes encoded, which explains why the utf-8 format of the string "haha", the output is "鍝堝搱", because ' \xe5\x93\x88\xe5\x93 \x88 ' use GB2312 to explain, its display is "鍝堝搱". Here again, str records a byte array, just some encoded storage format, as to the output to a file or print out what format, completely depends on its decoding the encoding to decode it.

Here's a little bit more about print: When you pass a Unicode object to print, the Unicode object is converted internally and the default encoding of the cost is converted (This is just a guess).

Ii. Conversion of STR and Unicode objects

the conversion of STR and Unicode objects, implemented through encode and decode, is specifically used as follows:

Convert gbk ' haha ' to Unicode, then convert to UTF8

Third, setdefaultencoding

As shown in the demo code above:

When the S (GBK string) is encoded directly into Utf-8, an exception is thrown, but by invoking the following code:

Import Sys

Reload (SYS)

Sys.setdefaultencoding (' GBK ')

Then you can convert to success, why? In python str and Unicode in the encoding and decoding process, if you encode a STR directly into another encoding, you will first decode STR to Unicode, the encoding is the default encoding, and the general default encoding is anscii , so there is an error in the first conversion in the example code above, and when you set the current default encoding of ' GBK ', there is no error.

As for Reload (SYS) because the Python2.5 will delete the Sys.setdefaultencoding method after initialization, we need to reload it .

Four, the operation of different documents encoded in the format of the file

Create a file test.txt, file format with ANSI, content:

ABC Chinese

Using Python to read

# CODING=GBK

Print open ("Test.txt"). Read ()

Result: ABC Chinese

Change the file format to UTF-8:

Result: ABC Juan  PO

Obviously, you need to decode this:

# CODING=GBK

Import Codecs

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Result: ABC Chinese

The above Test.txt I edited with EditPlus, but when I use the Notepad editor with Windows as the UTF-8 format,

Run times wrong:

Traceback (most recent call last):

File "chinesetest.py", line 3, in

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte

It turns out that some software, such as Notepad, when saving a file encoded in UTF-8, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) where the file begins.

So we need to get rid of these characters when we read, and the codecs module in Python defines this constant:

# CODING=GBK

Import Codecs

data = open ("Test.txt"). Read ()

If data[:3] = = codecs. Bom_utf8:

data = Data[3:]

Print Data.decode ("Utf-8")

Result: ABC Chinese

V. the document encoding format and the role of the coding Declaration

What does the source file's encoding format do to the declaration of a string? This problem has been bothering me for a long time, and now finally a little bit. The encoding format of the file determines the encoded format of the string declared in the source file, for example:

str = ' haha '

Print repr (str)

A. if the file format is Utf-8, the value of STR is: ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoding)

B. if the file format is GBK, the value of STR is: ' \xb9\xfe\xb9\xfe ' (haha GBK encoding)

In the first section it has been said that the string in Python, is just a byte array, so when the Str of a case is output to the GBK encoded console, it will be displayed as garbled: 鍝堝搱, and when the str output in B case is UTF-8 encoded console, it will also show the garbled problem, is nothing, Perhaps ' \xb9\xfe\xb9\xfe ' with utf-8 decoding display, is blank bar. >_<

Speaking of the file format, now to talk about the role of coding statements , each file in the top place, will use the # CODING=GBK similar statements to declare the code, but what is the use of this statement? Until then, I think it works three:

Declares that non-ASCII encoding will appear in the source file, usually in Chinese;
In the Advanced IDE, the IDE saves your file format as you specify the encoding format.
The code format used to decode the ' ha ' into Unicode, which is similar to the U ' ha ' in the source code, is also a more confusing place to see examples:

#coding: GBK

ss = U ' haha '

Print REPR (ss)

print ' ss:%s '% SS

Save this code as a utf-8 text and run, what do you think it will output? Everyone's first feeling sure the output is definitely:

U ' \u54c8\u54c8 '

SS: haha

But the output is actually:

U ' \u935d\u581d\u6431 '

SS:鍝堝搱

Why is this, at this time, is the code statement in the mischief, in the Running SS = U ' ha ', the whole process can be divided into the following steps:

1 get ' haha ' encoding: determined by file encoding format, for ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoded form)

2 when converted to Unicode encoding, in this conversion process, for ' \xe5\x93\x88\xe5\x93\x88 ' decoding, not with utf-8 decoding, but with the declaration code at the specified code GBK, will ' \xe5\x93\x88\ Xe5\x93\x88 ' GBK decoding, Get is ' 鍝堝搱 ', the Unicode encoding of these three words is U ' \u935d\u581d\u6431 ', which can explain why print repr (ss) output is U ' \u935d\u581d\ u6431 ' up.

Okay, here's a little detour, let's analyze the next example:

#-*-Coding:utf-8-*-

ss = U ' haha '

Print REPR (ss)

print ' ss:%s '% SS

This example is saved as a GBK encoding, running the result, unexpectedly:

Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb9 in position 0:unexpected code byte

Why is there a UTF8 decoding error here? Think of the previous example also understand that the first step in the conversion, because the file encoding is GBK, get the ' ha ' encoding is GBK encoding ' \xb9\xfe\xb9\xfe ', when the second step, converted to Unicode, will be used UTF8 to ' \xb9\xfe\xb9\ Xfe ' to decode, and the utf-8 coding table will find that the UTF8 encoding table (for UTF-8 interpretation can be seen in character-coded notes: ASCII, UTF-8, UNICODE) does not exist, so the above error is reported.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More