A trap in PythonUnicode string formatting

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, I helped my colleagues study an inexplicable UnicodeDecodeError and found a small trap in Python string formatting. record it here. The original code is too complex and there are too many things irrelevant to the problem, so I tried to reproduce the problem in ipython in a simple experiment. today, I helped my colleagues study an inexplicable UnicodeDecodeError and found a small trap in Python string formatting. record it here. The original code is too complex and there are too many things irrelevant to the problem. so I simply tried to reproduce the problem in ipython. The process is as follows:

In [4]: a = 'Hello world' In [5]: print 'say this: % s' % aSay this: Hello world In [6]: print 'say this: % s and say that: % s' % (a, 'Hello World') Say this: hello world and say that: hello worldIn [7]: print 'say this: % s and say that: % s' % (a, u 'Hello World') specify UnicodeDecodeError Traceback (most recent call last)/home/jerry/in () UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 10: ordinal not in range (128) In [8]: aOut [8]: '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd \ xe4 \ xb8 \ x96 \ xe7 \ x95 \ x8c'

Did you see the weird UnicodeDecodeError after In [7? The only difference between it and the previous sentence is that 'Hello World' is a unicode object instead of a str object. But the problem is that 'hello world' is only an English string that does not contain any characters other than ASCII. how can it be decode? Take a closer look at the message that comes with the exception. 0xe4 is mentioned in it. this is obviously not in 'Hello World', so you can only doubt the Chinese sentence, in [8] prints out its byte sequence. it is indeed it, and the first is 0xe4.

It seems that Python tries to convert a decode into a unicode object during string formatting, and decode uses the default ASCII encoding instead of the actual UTF-8 encoding. So what is the problem ?? Next we will continue our experiment:

In [9]: 'Say this: %s' % 'hello'Out[9]: 'Say this: hello'In [10]: 'Say this: %s' % u'hello'Out[10]: u'Say this: hello'

Take a closer look, 'hello' In [9] is a normal string and the result is also a string (str object), while 'hello' In [10] is a unicode object, the formatted result is also unicode (note the u at the beginning of the result ).

So the truth is: Python has some hidden small actions when formatting strings: If unicode exists in the parameters corresponding to % s, the final result is unicode. In this case, the template string and all the str values in the % s parameter are converted to unicode by decode. However, this decode is implicit and users cannot specify the charset they use, python can only use the default ASCII. If there is a non-ASCII string in it, it will be finished ......

Let's take a look at what the Python documentation says:

If format is a Unicode object, or if any of the objects being converted using the %s conversion are Unicode objects, the result will also be a Unicode object.

If str and unicode are mixed in the code, this issue is very likely to occur. In my colleagues' code, the Chinese string is the str object that the user inputs, which is encoded in UTF-8 after correct encoding; but the troublesome unicode object, although its content is all ASCII code, its source is the query result of the sqlite3 database, while the strings returned by the sqlite API are unicode objects, resulting in such a weird result.

The str and unicode of Python 2 are really tough, and they have been harmed several times. Python 3 has been improved in this respect and we look forward to its full popularity!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A trap in PythonUnicode string formatting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A trap in PythonUnicode string formatting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support