Python Unicode and Chinese Processing

Last Update:2018-12-03 Source: Internet

Author: User

Tags windows ssh client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python Unicode and Chinese Processing From: http://hi.baidu.com/jackleehit/blog/item/ea93618e1051131cb31bbaac.html

Unicode in python is confusing and difficult to understand. This article strives to completely solve these problems;

1. Unicode, GBK, gb2312, UTF-8 relationships;

Http://www.pythonclub.org/python-basic/encode-detail this article write better, UTF-8 is a unicode implementation method, Unicode, GBK, gb2312 is the encoding Character Set;

2. Chinese Encoding Problems in Python;

Encoding in the 2.1. py file

Python default script files are all anscii encoded. When there are characters in the file that are not within the anscii encoding range, use the "encoding indication" to correct them. In a module definition, if the. py file contains Chinese characters (strictly speaking, it contains non-anscii characters), you must specify the encoding declaration in the first or second line:

#-*-Coding = UTF-8-*-or # Coding = Other UTF-8 encoding methods, such as GBK and gb2312. Otherwise, syntaxerror: non-ASCII character '\ xe4' in file chinesetest. PY
On line 1, but no encoding declared; see http://www.pytho
Exception information such as details; n.org/peps/pep-0263.html

2.2 encoding and decoding in Python

First, let's talk about the string type in Python. There are two types of strings in Python: Str and Unicode, both of which are basestring Derived classes. The STR type is a character represent (at least) 8-bit bytes sequence; each unit of Unicode is a unicode OBJ; so:

The LEN (u'china') value is 2; the Len ('AB') value is also 2;

In the STR document, the string data type is also used to represent arrays of bytes, e.g ., to hold data read from a file. that is to say, when reading the content of a file or reading the content from the network, the object is maintained as the STR type. If you want to convert a STR to a specific encoding type, you need to convert STR to Unicode, and then convert it From Unicode to a specific encoding type such as UTF-8 and gb2312;

Conversion functions provided in Python:

Unicode to gb2312, UTF-8, etc.

#-*-Coding = UTF-8 -*-

If _ name _ = '_ main __':

S = u'china'

S_gb = S. encode ('gb2312 ')

UTF-8 and GBK are converted to Unicode using the Unicode (S, encoding) function or S. Decode (encoding) function)

#-*-Coding = UTF-8 -*-

If _ name _ = '_ main __':

S = u'china'

# First convert s to UTF-8 for Unicode

S_utf8 = S. encode ('utf-8 ')

Assert (s_utf8.decode ('utf-8') = s)

Convert normal STR to Unicode

#-*-Coding = UTF-8 -*-

If _ name _ = '_ main __':

S = 'China'

Su = u'china''

# First convert s to UTF-8 for Unicode

# Because S is the. py (#-*-coding = UTF-8-*-) encoded as UTF-8

S_unicode = S. Decode ('utf-8 ')

Assert (s_unicode = su)

# Convert s to gb2312, convert it to Unicode first, and then convert it to gb2312

S. Decode ('utf-8'). encode ('gb2312 ')

# What will happen if S. encode ('gb2312') is directly executed?

S. encode ('gb2312 ')

#-*-Coding = UTF-8 -*-

If _ name _ = '_ main __':

S = 'China'

# What will happen if S. encode ('gb2312') is directly executed?

S. encode ('gb2312 ')

An exception occurs here:

Python automatically decodes s to Unicode and then encodes it into gb2312. Because the decoding is automatically performed by python, we do not specify the decoding method, python will use the method specified by sys. defaultencoding to decode. In many cases, SYS. defaultencoding is anscii. If S is not of this type, an error occurs.

Taking the above information, my sys. defaultencoding is anscii, and the S encoding method is the same as the file encoding method, which is utf8, so the error is: unicodedecodeerror: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128)

In this case, we have two ways to correct the error:

1. clearly indicate the encoding method of S.

#! /Usr/bin/ENV Python

#-*-Coding: UTF-8 -*-

S = 'Chinese'

S. Decode ('utf-8'). encode ('gb2312 ')

Second, change SYS. defaultencoding to the file encoding method.

#! /Usr/bin/ENV Python

#-*-Coding: UTF-8 -*-

Import sys

Reload (sys) # The method SYS. setdefaultencoding will be deleted after python2.5 initialization. We need to reload

SYS. setdefaultencoding ('utf-8 ')

STR = 'Chinese'

Str. encode ('gb2312 ')

File encoding and print Functions

Create a file named test.txt in ANSI format with the following content:

ABC Chinese

Read data using Python

# Coding = GBK

Print open ("test.txt"). Read ()

Result: ABC (Chinese)

The file format into UTF-8:

Result: ABC Juan

Obviously, decoding is required here:

# Coding = GBK

Import codecs

Print open ("test.txt"). Read (). Decode ("UTF-8 ")

Result: ABC (Chinese)

I used editplus to edit test.txt, but when I used the notepad that came with windows to edit and coexist in UTF-8 format,

Running error:

Traceback (most recent call last ):

File "chinesetest. py", line 3, in <module>

Print open ("test.txt"). Read (). Decode ("UTF-8 ")

Unicodeencodeerror: 'gbk' codec can't encode character U' \ ufeff 'in position 0: Illegal multibyte Sequence

Originally, some software, such as Notepad, will insert three invisible characters (0xef 0xbb 0xbf, BOM) at the beginning of the file when saving a file encoded in UTF-8 ).

Therefore, we need to remove these characters during reading. The codecs module in Python defines this constant:

# Coding = GBK

Import codecs

Data = open ("test.txt"). Read ()

If data [: 3] = codecs. bom_utf8:

Data = data [3:]

Print data. Decode ("UTF-8 ")

Result: ABC (Chinese)

(4) A few issues left behind

In the second part, we use the Unicode function and decode method to convert STR to Unicode. Why do the parameters of these two functions use "GBK?

The first response is that GBK (# Coding = GBK) is used in our encoding statement, but is it true?

Modify the source file:

# Coding = UTF-8

S = "Chinese"

Print Unicode (S, "UTF-8 ")

Run, error:

Traceback (most recent call last ):

File "chinesetest. py", line 3, in <module>

S = Unicode (S, "UTF-8 ")

Unicodedecodeerror: 'utf8' codec can't decode bytes in position 0-1: Invalid Data

Obviously, if both sides use GBK, I keep the same UTF-8 on both sides, and it should be normal, so as not to report an error.

For a further example, if the conversion here still uses GBK:

# Coding = UTF-8

S = "Chinese"

Print Unicode (S, "GBK ")

Result: Chinese

I read an English document to explain the principle of print in Python:

When Python executes a print statement, it simply passes the output to the Operating System (using fwrite () or something like it ), and some other program is responsible for actually displaying that output on the screen. for example, on Windows, it might be
The Windows console subsystem that displays the result. or if you're using Windows and running python on a Unix box somewhere else, your windows SSH client is actually responsible for displaying the data. if you are running python in an xterm on UNIX, then
Xterm and your X Server handle the display.

To print data reliably, you must know the encoding that this display program expects.

To put it simply, print in Python directly transmits the string to the operating system, so you need to decode STR to the same format as the operating system. Cp936 (almost the same as GBK) is used in windows, so GBK can be used here.

Last test:

# Coding = UTF-8

S = "Chinese"

Print Unicode (S, "cp936 ")

Result: Chinese

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More