Python Chinese questions

Source: Internet
Author: User

In this article, "Ha" is interpreted as an example to explain all the problems, the various encodings of "Ha" are as follows:
1. UNICODE (utf8-16), C854;
2. utf-8,e59388;
3. Gbk,b9fe.
First, python in the Str and the Unicode
For a long time, the Chinese encoding in Python is a very big problem, often throw the code conversion exception, what is STR and Unicode in Python exactly what is it?
Unicode is referred to in Python, which is generally referred to as Unicode objects, such as the ' Haha ' Unicode object
U ' \u54c8\u54c8 '
STR, which is a byte array, represents the format of the storage after encoding the Unicode object (which can be utf-8, GBK, cp936, GB2312). Here it is just a stream of words, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
For example:

For Unicode objects haha encoded, encoded into a UTF-8 encoded Str-s_utf8,s_utf8 is a byte array, storing is ' \xe5\x93\x88\xe5\x93\x88 ', but this is just a byte array, If you want to output it through the print statement to Haha, then you are disappointed, why?

Because the print statement its implementation is going to output the content of the operating system, the operating system will encode the input byte stream according to the system encoding, which explains why the utf-8 format string "haha", the output is "Å 堝 搱", because ' \xe5\x93\x88\xe5\x93 \x88 ' with GB2312 to explain, its display is "Å 堝 搱". Here again, str records a byte array, just some encoding of the storage format, as to the output to the file or print out what format, completely depends on the decoding of its encoding to what it looks like.

Here's a little bit more on print: When a Unicode object is passed to print, the Unicode object is internally converted and converted to the default encoding of the cost (this is just a guess)

Second, Str conversions to and from Unicode objects

The conversion of STR and Unicode objects, implemented by encode and decode, is used as follows:

Convert gbk ' haha ' to Unicode and then convert to UTF8

Third, setdefaultencoding

As shown in the demo code:

When the S (GBK string) is encoded directly into Utf-8, an exception is thrown, but by invoking the following code:

Import Sys

Reload (SYS)

Sys.setdefaultencoding (' GBK ')

Can be converted to success, why? In Python, in the encoding and decoding process, if one STR is encoded directly into another encoding, STR is decoded to Unicode, the encoding is the default encoding, and the general default encoding is ANSCII, So in the example code above the first time the conversion error, when the current default encoding is ' GBK ', there will be no error.

As for Reload (SYS), we need to reload the method because it will remove sys.setdefaultencoding after Python2.5 initialization.

Four, the operation of different files encoded format files

Create a file test.txt, file format with ANSI, content:

ABC Chinese

Using Python to read

# CODING=GBK

Print open ("Test.txt"). Read ()

Result: ABC Chinese

Change the file format to UTF-8:

Result: ABC Juan PO

Clearly, this needs to be decoded:

# CODING=GBK

Import Codecs

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Result: ABC Chinese

The above test.txt I was using editplus to edit, but when I use Windows to bring the Notepad editor and UTF-8 format,

Run Times Error:

Traceback (most recent):

File "chinesetest.py", line 3, in

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte sequence

It turns out that some software, such as Notepad, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a UTF-8 encoded file.

So we need to remove these characters when we read them, and the codecs module in Python defines this constant:

# CODING=GBK

Import Codecs

data = open ("Test.txt"). Read ()

If data[:3] = = codecs. Bom_utf8:

data = Data[3:]

Print Data.decode ("Utf-8")

Result: ABC Chinese

V. The role of the encoding format and the encoding declaration of the document

What does the encoding format of the source file do to the declaration of a string? This problem has been bothering me for a long time, and now finally a little bit, the encoding format of the file determines the encoding format of the string that is declared in the source file, for example:

str = ' haha '

Print repr (str)

A. If the file format is Utf-8, the value of STR is: ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoding)

B. If the file format is GBK, the value of STR is: ' \xb9\xfe\xb9\xfe ' (haha GBK encoding)

As already mentioned in the first section, the string in Python is just a byte array, so when a case of STR output to the GBK encoded console, it will be displayed as garbled: Å 堝 搱, and when the B case of the STR output UTF-8 encoded console, will also show garbled problem, is nothing, Perhaps ' \xb9\xfe\xb9\xfe ' with utf-8 decoding display, is blank bar. >_<

To finish the file format, now talk about the role of the code declaration, each file in the top place, will use the # CODING=GBK similar statements to declare the code, but what is the use of this statement? Until the end, I think it's a function of three:

1. Non-ASCII encoding will appear in the source file, usually in Chinese;

2. In the advanced IDE, the IDE will save your file format as you specify the encoding format.

3. It is also a confusing place to determine the encoding format used to decode ' ha ' into Unicode, similar to the U ' ha ' in the source code, as shown in the example:

#coding: GBK

ss = U ' haha '

Print REPR (ss)

print ' ss:%s '% SS

Save this code as a utf-8 text, run, what do you think it will output? Everyone first feel sure the output is definitely:

U ' \u54c8\u54c8 '

SS: Haha

But the actual output is:

U ' \u935d\u581d\u6431 '

SS: Å 堝 搱

Why this, this time, is the code statement in the mischief, in the Run SS = U ' haha ', the whole process can be divided into the following steps:

1) Get ' haha ' encoding: determined by the file encoding format, for ' \xe5\x93\x88\xe5\x93\x88 ' (haha utf-8 encoded form)

2) When converting to Unicode encoding, in this conversion process, for the ' \xe5\x93\x88\xe5\x93\x88 ' decoding, not with utf-8 decoding, but with the code specified at the declaration GBK, will ' \xe5\x93\x88\xe5\x93\ X88 ' by GBK decoding, get is ' å 堝 搱 ', the three-word Unicode encoding is U ' \u935d\u581d\u6431 ', to stop can explain why print repr (ss) output is U ' \u935d\u581d\u6431 '.

OK, here's a bit of a detour, let's analyze the next example:

#-*-Coding:utf-8-*-

ss = U ' haha '

Print REPR (ss)

print ' ss:%s '% SS

This example is saved as GBK encoding, running the result, unexpectedly:

Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb9 in position 0:unexpected code byte

Why is there a UTF8 decoding error here? Think of the last example also understand that the first step of conversion, because the file encoding is GBK, the result is ' haha ' encoding is GBK encoding ' \xb9\xfe\xb9\xfe ', when the second step, converted to Unicode, will use UTF8 to ' \xb9\xfe\xb9\xfe ' decoding, and the utf-8 of the Code table will find that the UTF8 encoding table (about UTF-8 interpretation can be seen in character encoding notes: ASCII, UTF-8, UNICODE) does not exist at all, so the above error is reported.

Python Chinese question (go)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.