The Chinese encoding processing of 001_python2

Source: Internet
Author: User

In recent business you need to write some scripts in Python. Although the script's interaction was only command line + log output, I decided to output the log information in Chinese in order to make the interface more friendly.

Soon, I encountered an exception:

Python code
    1. Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-3:ordinal not in range ( 128)

To solve the problem, I took the time to study Python's character encoding processing. There are also a lot of articles on the internet about Python's character encoding, but I have seen it once and feel that I can speak more clearly.

The following is the basis of the Python string, familiar with this content can be skipped.

Python also has two string types, str and Unicode, that correspond to the char and wchar_t of C + +:

Python code
  1. #-*-Coding:utf-8-*-
  2. # file:example1.py
  3. Import string
  4. # This is a string of str
  5. s = ' off off Ospreys '
  6. # This is a Unicode string
  7. u = u' off off Ospreys '
  8. Print isinstance (S, str) # True
  9. Print isinstance (U, Unicode) # True
  10. Print s.__class__ # <type ' str ' >
  11. Print u.__class__ # <type ' Unicode ' >

The previous declaration:#-*-Coding:utf-8-*- shows that the above Python code is encoded by Utf-8.

In order to ensure that the output will not be garbled on the Linux terminal, you need to set the environment variables of Linux: Export Lang=en_us. UTF-8

If you are using SECURECRT as I do, set the Session options/terminal/appearance/character Encoding to UTF-8 and ensure that the output of the Linux terminal is correctly decoded.

The Encode/decode method can be used to convert between two Python string types:

Python code
    1. # Convert from STR to Unicode
    2. Print S.decode (' utf-8 ') # off off Ospreys
    3. # Convert from Unicode to STR
    4. Print U.encode (' utf-8 ') # off off Ospreys

Why is the encode from Unicode to STR, which in turn is called decode?

Because Python considers 16-bit Unicode to be the only inner code of a character, the commonly used character set such as Gb2312,gb18030/gbk,utf-8, and ASCII are binary (byte) encoded forms of characters. Converting a character from Unicode to a binary encoding is, of course, encode.

In turn, the STR that appears in Python is an ANSI string encoded with a character set. Python itself does not know the encoding of STR and requires the developer to specify the correct character set decode.

(In fact, Python is known to encode str.) Because we declared in front of the code #-*-Coding:utf-8-*-, which indicates that STR in the code is encoded with Utf-8, I don't know why Python doesn't. )

What happens if I use the wrong character set to Encode/decode?

Python code
  1. # ASCII-encoded Unicode strings with Chinese characters
  2. U.encode (' ASCII ') # error Because Chinese cannot be encoded in ASCII character set
  3. # unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-3: Ordinal isn't in range (+)
  4. # Encode Unicode strings with GBK in Chinese
  5. U.encode (' GBK ') # Correct, because ' off-off Ospreys ' can be expressed in Chinese GBK character set
  6. # ' \xb9\xd8\xb9\xd8\xf6\xc2\xf0\xaf '
  7. # Direct print above STR will show garbled, modify environment variable to ZH_CN. GBK can see the results are right.
  8. # decoding UTF-8 strings in ASCII
  9. S.decode (' ASCII ') # error, Chinese utf-8 character cannot be decoded with ASCII
  10. # unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (+)
  11. # decode Utf-8 strings with GBK
  12. S.decode (' GBK ') # is not wrong, but the result of decoding a stream of utf-8 characters with GBK is obviously just garbled
  13. # u ' \u934f\u51b2\u53e7\u95c6\u5ea8\u7b2d '

This meets the exception I posted at the beginning of this article: unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-3: Ordinal not in range (128)

Now we know that this is a string encoding exception. Next, why is Python so prone to string encoding/decoding exceptions?

This refers to the two pitfalls that are easily encountered when handling Python encoding. The first one is about string connections:

Python code
    1. #-*-Coding:utf-8-*-
    2. # file:example2.py
    3. # This is a string of str
    4. s = ' off off Ospreys '
    5. # This is a Unicode string
    6. u = u' off off Ospreys '
    7. s + U # failed, Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range ($)

A simple string connection can also have decoding errors?

Trap One : In the case of both STR and Unicode, Python converts STR to Unicode and, of course, the result of the operation is Unicode.

Since Python does not know the encoding of STR in advance, it can only use sys.getdefaultencoding () encoding to decode. In my impression, the value of sys.getdefaultencoding () is always ' ASCII '--obviously, if you need to convert a str with Chinese, there will be an error.

In addition to string joins, the result of the% operation is the same:

Python code
  1. # correct, all strings are str, no need to decode
  2. "Chinese:%s"% s # Chinese: Off off Ospreys
  3. # failed, equivalent to run: "Chinese:%s". Decode (' ASCII ')% u
  4. "Chinese:%s"% u # unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (+)
  5. # correct, all strings are Unicode and do not require decode
  6. U"Chinese:%s"% u # Chinese: Off off Ospreys
  7. # failed, equivalent to run: U "Chinese:%s"% s.decode (' ASCII ')
  8. U"Chinese:%s"% s # unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (128)

I don't understand why sys.getdefaultencoding () has nothing to do with $LANG environment variables. If Python sets the value of sys.getdefaultencoding () with the $LANG, at least the developer's chance of encountering unicodedecodeerror will be reduced by 50%.

And, as I said earlier, I also wonder why Python doesn't refer to #-*-Coding:utf-8-*- here, because Python always checks your code before it runs, which guarantees that STR defined in the code must be utf-8.

My only suggestion for this question is to write u before the Chinese string in the code. Also, in Python 3 The STR has been canceled, so that all strings are unicode--this may be the right decision.

In fact, the value of sys.getdefaultencoding () can be modified by "backdoor" way, I do not particularly recommend this solution, but still paste it, because later useful:

Python code
  1. #-*-Coding:utf-8-*-
  2. # file:example3.py
  3. Import Sys
  4. # This is a string of str
  5. s = ' off off Ospreys '
  6. # This is a Unicode string
  7. u = u' off off Ospreys '
  8. # Make sys.getdefaultencoding () a value of ' utf-8 '
  9. Reload (SYS) # Reload to invoke Setdefaultencoding method
  10. Sys.setdefaultencoding (' utf-8 ') # set ' Utf-8 '
  11. # no problem.
  12. s + u # u ' \u5173\u5173\u96ce\u9e20\u5173\u5173\u96ce\u9e20 '
  13. # no problem, too.
  14. "Chinese:%s"% u # u ' \u4e2d\u6587\uff1a\u5173\u5173\u96ce\u9e20 '
  15. # Still no problem #
  16. U"Chinese:%s"% s # u ' \u4e2d\u6587\uff1a\u5173\u5173\u96ce\u9e20 '

As you can see, the problem is magically solved. But watch out! The effect of sys.setdefaultencoding () is global, if your code consists of several different coded Python files, this method simply presses the gourd to float the scoop, making the problem complicated.

Another pitfall is related to standard output.

What happened? I always say to set the correct Linux $LANG environment variables. Then, set the wrong $LANG, such as ZH_CN. What will happen to GBK? (Avoid terminal effects, please set the SECURECRT to the same character set.) )

Obviously it will be garbled, but not all output is garbled.

Python code
  1. #-*-Coding:utf-8-*-
  2. # file:example4.py
  3. Import string
  4. # This is a string of str
  5. s = ' off off Ospreys '
  6. # This is a Unicode string
  7. u = u' off off Ospreys '
  8. # Output STR string, display is garbled
  9. print s # Å chong just boss family Jelly
  10. # Output Unicode string, display correctly
  11. Print U # off off Ospreys

Why is Unicode instead of str character display correct? First we need to understand print. Like all languages, this Python command actually prints characters to the standard output stream--sys.stdout. And Python changed a magic here, it will follow sys.stdout.encoding to the Unicode encoding, and the output of STR directly to the operating system to solve.

This is why setting up Linux $LANG environment variables is consistent with SECURECRT, otherwise these characters will be securecrt and then converted to the desktop by the Windows system encoded CP936 or GBK to display.

Typically, the value of sys.stdout.encoding is consistent with the Linux $LANG environment variables:

Python code
  1. #-*-Coding:utf-8-*-
  2. # file:example5.py
  3. Import Sys
  4. # Check the encoding of the standard output stream
  5. Print sys.stdout.encoding # set $LANG = Zh_cn. GBK, Output GBK
  6. # Set $LANG = en_US. UTF-8, Output UTF-8
  7. # This is a Unicode string
  8. u = u' off off Ospreys '
  9. # Output Unicode string, display correctly
  10. Print U # off off Ospreys

But there's a trap here. Two: Once your Python code is running in a pipeline/subprocess, sys.stdout.encoding will fail and let you re-encounter Unicodeencodeerror.

For example, run the above example4.py code in a pipeline:

Python code
    1. Python-u example5.py | More
    2. Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-3:ordinal not in range ( 128)
    3. None

As you can see, the first : The value of sys.stdout.encoding becomes None; second : Python tries to encode Unicode in ASCII when it is print.

Because the ASCII character set cannot be used to represent Chinese characters, this will of course fail the encoding.

How to solve this problem? I do not know how others are done, in short I used an ugly way:

Python code
  1. #-*-Coding:utf-8-*-
  2. # file:example6.py
  3. Import OS
  4. Import Sys
  5. Import Codecs
  6. # in any case, use the current character set of your Linux system to output:
  7. If Sys.stdout.encoding is None:
  8. ENC = os.environ[' LANG '].split ('. ') [1]
  9. Sys.stdout = Codecs.getwriter (ENC) (sys.stdout) # replace Sys.stdout
  10. # This is a Unicode string
  11. u = u' off off Ospreys '
  12. # Output Unicode string, display correctly
  13. Print U # off off Ospreys

This method still has a side effect: the direct output of the Chinese str will fail because the writer of the codecs module, contrary to Sys.stdout's behavior, will convert all str with the sys.getdefaultencoding () character set to the Unicode input Out

Python code
    1. # This is a string of str
    2. s = ' off off Ospreys '
    3. # output STR string, exception
    4. Print S # unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range ($)

Obviously, the value of sys.getdefaultencoding () is ' ASCII ' and the encoding fails.

The solution is like example3.py said, you either give STR plus u to declare Unicode, or through "backdoor" to modify sys.getdefaultencoding ():

Python code
    1. # Make sys.getdefaultencoding () a value of ' utf-8 '
    2. Reload (SYS) # Reload to invoke Setdefaultencoding method
    3. Sys.setdefaultencoding (' utf-8 ') # set ' Utf-8 '
    4. # This is a string of str
    5. s = ' off off Ospreys '
    6. # Output str string, OK
    7. Print S # off off Ospreys

In summary, it is a dangerous thing to do Chinese input and output under Python 2, especially if you mix str and Unicode in your code.

Some modules, such as JSON, return a string of Unicode type directly, allowing your% operation to fail with character decoding. Some will return directly to STR, and you need to know their true code, especially when it comes to print.

To avoid some pitfalls, the best way to do this is to always use U to define Chinese strings in Python code. In addition, if your code needs to run in a pipeline/subprocess, you need to use the example6.py technique.

Finish

The Chinese encoding processing of 001_python2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.