Python encoding details, python encoding details

Last Update:2014-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python encoding details, python encoding details
Python Coding

After the character set and encoding details in the previous article summarize the common character encoding, this article will analyze and summarize the common Encoding Problems in python. Because python3.x and python2.x differ greatly in character encoding, this article uses Python2.7.5 to analyze the character encoding problems in 2.x.

1. Python encoding basics 1.1 str and unicode

Python has two data models to support string data types, str and unicode. Their base classes are basestring. For exampleS = "Chinese"Is a string of the str type, andU = u "Chinese"Is a unicode string. Unicode is obtained after decoding a string of the str type. unicode can also be encoded into the str type. That is

str --> decode -->unicodeunicode --> encode --> str

Strictly speaking, str may be called a byte string, because the result of using the len () function for the str type encoded in the UTF-8 is 6, because the str type encoded in the UTF-8"Chinese"Actually"\xe4\xb8\xad\xe6\x96\x87". For the unicode type u "Chinese" (actuallyu"\u4e2d\u6587"), Use the len () function. The result is 2.

1.2 header encoding statement

If non-ascii characters such as Chinese characters are used in python source code files, you must declare the source code character encoding in the header of the source code file. The format is as follows:

#-*- coding: utf-8 -*-

This format seems complicated. In fact, python only checks strings such as #, coding, and encoding, which can be abbreviated as # coding: UTF-8 or even # coding: u8.

2. Python2.x common coding problems 2.1 header coding instructions and file Encoding Problems

The file header encoding declaration determines the str encoding Selection Method in the python parsing source code. For example, if the header declares UTF-8 encodingS = "Chinese"Python will be parsed in UTF-8 encoding format.repr(s)The character encoding is"\xe4\xb8\xad\xe6\x96\x87"If the encoding declared in the header is gbk encoding, python will parse s using gbk encoding. The result is"\xd6\xd0\xce\xc4".

It should be noted that the encoding of the file itself must be consistent with the declared encoding of the file header, otherwise there will be problems. The file is encoded in Linux. You can use commands in vim.set fenc. If the file encoding is gbk and the declared encoding in the header of the source code file is UTF-8, the problem may occur if there is Chinese in the source code, because the Chinese str storage is encoded according to gbk, while python thinks it is UTF-8 encoding when parsing str, it will reportSyntaxError: (unicode error) 'utf8' codec can't decode byteError.

2.2 default encoding Problems

The following describes the problems caused by python default encoding:

# Coding: utf-8u = u "Chinese" print repr (u) # U' \ u4e2d \ u6587's = "Chinese" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 '# s2 = u. decode ("UTF-8") # Encoding Error # u2 = s. encode ("UTF-8") # decoding error

Note the two lines of code commented out in the instance. It is best not to directly call decode for unicode, and str is best not to directly call the encode method. Because if it is called directly, it is equivalentu.encode(default_encoding).decode("utf-8"), Default_encoding is the default encoding used in python unicode implementation, that issys.getdefaultencoding()If you have not set the encoding, the default encoding is ascii. If your unicode itself exceeds the ascii encoding range, an error is returned. Similarly, if you call the encode method directly for str, str will be decoded by default, that is, s. decode (default_encoding ). encode ("UTF-8"). If str itself is Chinese and default_encoding is ascii, the decoding will fail, causing the above two lines to reportUnicodeEncodeError: 'ascii' codec can't encode characters in position...Errors andUnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position...Error.

If the two lines of code commented out in the above example are executed, an error is returned. Of course, if both str and unicode are within the ascii encoding range, there is no problem. For examples = "abc"; s.encode("utf-8")There will be no problem. After the statement is executed, a str with a different id than s will be returned.

If you want to solve the problem in instance 1, you can use either of the following methods:

# Coding: utf-8u = u "Chinese" print repr (u) # U' \ u4e2d \ u6587's = "Chinese" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 's2 = u. encode ("UTF-8 "). decode ("UTF-8") # OK u2 = s. decode ("utf8 "). encode ("UTF-8") # OK

The second method is to change python's default encoding to the file encoding format, as shown below (here only the reload sys module is required because python deletes the setdefaultencoding method after initialization ):

# Coding: UTF-8 import sys reload (sys) sys. setdefaultencoding ("UTF-8") # change the default encoding to utf-8u = u "" print repr (u) # U' \ u4e2d \ u6587's = "" print repr (s) # '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87 'u2 = s. decode ("UTF-8") print repr (u2) # U' \ u4e2d \ u6587 's2 = u. decode ("UTF-8") u2 = s. encode ("UTF-8 ")

2.3 read/write file encoding

When you use the python open () method to open a file, read () reads str, encoding is the encoding of the file itself. When you call write () to write a file, if the parameter is unicode, you must use the specified encoding encode. If the write () parameter is unicode and no encoding is specified, it will use the python default encoding encode before writing.

# Coding: UTF-8 f = open ("testfile") s = f. read () f. close () print type (s) # <type 'str'> u = s. decode ("UTF-8") # testfile is UTF-8 encoded f = open ("testfile", "w") f. write (u. encode ("gbk") # Write Data in gbk encoding. testfile is gbk encoded f. close ()

In addition, the python codecs module provides an open () method, which allows you to specify the encoding to open the file, and use this method to open the file and read and return unicode. When writing, if the write parameter is unicode, it is written using the encoding when the file is opened. If it is str, first, use the default encoding to decode to unicode and then write the code to open the file (Note that if str is Chinese, the default encoding is sys. if getdefaultencoding () is ascii, A decoding error is returned ).

# Coding: gbkimport codecsf = codecs. open ('testfile', encoding = 'utf-8') u = f. read () f. close () print type (u) # <type 'unicode '> f = codecs. open ('testfile', 'A', encoding = 'utf-8') f. write (u) # write unicode # write A gbk-encoded str to automatically perform the decoding and encoding operation s = 'hangzhou' print repr (s) # '\ xba' # Here, the str encoded in GBK is decoded as unicode and then encoded as a UTF-8 to write # f. write (s) # If the default encoding is ascii, A decoding error is returned. F. close ()

3. References

Python character encoding

An error is reported when reading the file content in python. The encoding problem should be returned.

We recommend that you use codecs. open to replace open. If the LogPath file is saved in UTF-8 encoding format:
LogLine = open (LogPath) ==> LogLine = codecs. open (LogPath, 'R', 'utf-8 ')

Character encoding in python

What you are talking about is to put the string:
\ U3232 \ u6674
Itself, can it be converted to unicode characters?

You can use:
SlashUStr = "\ u3232 \ u6674 ";
DecodedUniChars = slashUStr. decode ("unicode-escape ");
Print "decodedUniChars =", decodedUniChars; # decodedUniChars = (yes) Clear

Note: (yes) it is a special character. If you want to print it in cmd (gbk by default), it will cause an error.
UnicodeEncodeError: 'gbk' codec can't encode character U' \ u3232 'in position 0: illegal multibyte sequence

However, the unicode string is indeed converted.

For details, refer:
[Arrangement] in Python, how to convert a backslash u type (\ uXXXX) string to a corresponding unicode Character

(The post address cannot be posted here. Search for the title by google to find the post address)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python encoding details, python encoding details

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python encoding details, python encoding details

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support