A trap in PythonUnicode string formatting

Source: Internet
Author: User
Today, I helped my colleagues study an inexplicable UnicodeDecodeError and found a small trap in Python string formatting. record it here. The original code is too complex and there are too many things irrelevant to the problem, so I tried to reproduce the problem in ipython in a simple experiment. today, I helped my colleagues study an inexplicable UnicodeDecodeError and found a small trap in Python string formatting. record it here. The original code is too complex and there are too many things irrelevant to the problem. so I simply tried to reproduce the problem in ipython. The process is as follows:

In [4]: a = 'Hello world' In [5]: print 'say this: % s' % aSay this: Hello world In [6]: print 'say this: % s and say that: % s' % (a, 'Hello World') Say this: hello world and say that: hello worldIn [7]: print 'say this: % s and say that: % s' % (a, u 'Hello World') specify UnicodeDecodeError Traceback (most recent call last)/home/jerry/in () UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 10: ordinal not in range (128) In [8]: aOut [8]: '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd \ xe4 \ xb8 \ x96 \ xe7 \ x95 \ x8c'

Did you see the weird UnicodeDecodeError after In [7? The only difference between it and the previous sentence is that 'Hello World' is a unicode object instead of a str object. But the problem is that 'hello world' is only an English string that does not contain any characters other than ASCII. how can it be decode? Take a closer look at the message that comes with the exception. 0xe4 is mentioned in it. this is obviously not in 'Hello World', so you can only doubt the Chinese sentence, in [8] prints out its byte sequence. it is indeed it, and the first is 0xe4.

It seems that Python tries to convert a decode into a unicode object during string formatting, and decode uses the default ASCII encoding instead of the actual UTF-8 encoding. So what is the problem ?? Next we will continue our experiment:

In [9]: 'Say this: %s' % 'hello'Out[9]: 'Say this: hello'In [10]: 'Say this: %s' % u'hello'Out[10]: u'Say this: hello'

Take a closer look, 'hello' In [9] is a normal string and the result is also a string (str object), while 'hello' In [10] is a unicode object, the formatted result is also unicode (note the u at the beginning of the result ).

So the truth is: Python has some hidden small actions when formatting strings: If unicode exists in the parameters corresponding to % s, the final result is unicode. In this case, the template string and all the str values in the % s parameter are converted to unicode by decode. However, this decode is implicit and users cannot specify the charset they use, python can only use the default ASCII. If there is a non-ASCII string in it, it will be finished ......

Let's take a look at what the Python documentation says:

If format is a Unicode object, or if any of the objects being converted using the %s conversion are Unicode objects, the result will also be a Unicode object.

If str and unicode are mixed in the code, this issue is very likely to occur. In my colleagues' code, the Chinese string is the str object that the user inputs, which is encoded in UTF-8 after correct encoding; but the troublesome unicode object, although its content is all ASCII code, its source is the query result of the sqlite3 database, while the strings returned by the sqlite API are unicode objects, resulting in such a weird result.

The str and unicode of Python 2 are really tough, and they have been harmed several times. Python 3 has been improved in this respect and we look forward to its full popularity!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.