Suggestions for handling Chinese coding problems in Python _python

Source: Internet
Author: User

Strings are the most commonly used data types in Python, and many times you use characters that do not belong to the standard ASCII character set, and the code is likely to throw Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xc4 In position 10:ordinal not in range (128) exception. This anomaly is easy to come across in Python, especially in python2.x, a problem that puzzles beginners. However, if you understand Python Unicode and follow certain principles in coding, this coding problem is relatively easy to understand and solve.

The representation of strings inside Python is Unicode encoding, so in encoding conversions, Unicode is usually used as the intermediate encoding, that is, the other encoded strings are decoded (decode) into Unicode. Again from Unicode encoding (encode) into another encoding. However, the Python 2.x's default encoding format is ASCII, which means that all characters in the source code will default to ASCII code without specifying the Python code format. Because of this root cause, unicodedecodeerror or unicodeencodeerror anomalies are often encountered in Python 2.x.

About Unicode

Unicode is a character set that provides a uniform serial number for each character that appears in each modern or ancient text system, with the binary code of symbols, but does not specify how the binary should be stored. That is to say: Unicode encoding way is fixed, but the realization way according to different needs have with a variety of, common have UTF-8, UTF-16 and UTF-32 and so on. For more information, you can see Wikipedia: Unicode

In order to be able to handle Unicode data, and some of the internal Python modules, Python 2.x provides a data type of Unicode that allows other encodings and Unicode encodings to be converted to each other through the decode and encode methods. But it also introduces Unicodedecodeerror and Unicodeencodeerror anomalies.

Several common coding anomalies

Several common coding anomalies in Python are syntaxerror:non-ascii character, Unicodedecodeerror, and Unicodeencodeerror. The following is an example of how to:

1, syntaxerror:non-ascii character

This anomaly is least likely to occur and is easiest to handle, mainly because there are non-ASCII characters in the Python source file, and the source code format is not declared, for example:

s = ' Chinese '
print S   # throws an exception

2, Unicodedecodeerror

This exception sometimes occurs when the Decode method is invoked because Python intends to convert other encoded characters into Unicode encoding, but the encoding format of the character itself is inconsistent with the encoding format passed in by the Decode method, for example:

s = ' Chinese '
s.decode (' gb2312 ') # unicodedecodeerror: ' gb2312 ' codec c A ' t decode bytes in position 2-3: illegal multibyte sequence
print s

The encoded format of string s in the above code is utf-8, but the arguments passed in when the Decode method is converted to Unicode encoding are ' gb2312 ', and therefore the Unicodedecodeerror exception is thrown when converting. There is also a situation in the encode:

s = ' Chinese '
s.encode (' gb2312 ') # unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal don't in range (128)
print S

3, Unicodeencodeerror

The error occurs when using the decode and encode methods, such as: When using the Decode method to convert a Unicode string:

s = U ' Chinese '
s.decode (' Utf-8 ') # unicodeencodeerror: ' ASCII ' codec CA N ' t encode characters in position 0-1: Ordinal don't in range (128)
print S

Of course, there are many examples of exceptions that may occur in addition to the exceptions listed above, which is not shown here in one by one.

Solving method

For the above several anomalies, there are the following methods and principles of treatment.

1, follow the PEP0263 principle, declare the code format

In pep 0263-defining Python source code encodings, the most basic solution to the Python coding problem is presented: in the Python source file, the most common way to declare the encoding format is as follows:

#-*-Coding: <encoding name>-*-

Where <encoding name> is the encoding format required by the code, it can be any kind of Python-supported format, typically using the UTF-8 encoding format.

2. Use u ' Chinese ' instead of ' Chinese '

str1 = ' Chinese code '
str2 = U ' Chinese encoding '

In Python, there are two ways to declare string variables, the main difference being the encoding format, where the encoding format of the STR1 is the same as that of the Python file declaration, while the STR2 encoding is Unicode. If you're declaring a string variable that has non-ASCII characters, it's best to use the STR2 declaration format so that you can manipulate the string without having to perform decode, and you can avoid some anomalies.

3. Reset Default Code

The root cause of so many coding problems in Python is that the default encoding format for Python 2.x is ASCII, so you can also modify the default encoding format in the following ways:

Import sys
sys.setdefaultencoding (' Utf-8 ')

This approach solves some of the coding problems, but it also introduces a number of other problems that are not recommended for use.

4, the ultimate principle: decode early, Unicode everywhere, encode late

Finally, share a final principle: decode early, Unicode everywhere, encode late, that is, when entering or declaring strings, use the Decode method to convert strings to Unicode encoding as early as possible ; then use the string in the program to uniformly use Unicode format for processing, such as string concatenation, string substitution, getting the length of the string, and so on; Finally, in the output string (Console/web page/file), the Encode method converts the string to the encoding format you want, such as Utf-8 and so on.

Using this principle to handle python strings can basically solve all the coding problems (as long as your code and the Python environment are OK) ...

5. Upgrade Python 2.x to 3.x

Well, the last method, upgrade Python 2.x, use the Python 3.x version. This is mainly to spit the Python 2.x coding design problem. Of course, upgrading to the Python 3.x will definitely solve most of the problems with coding exceptions. After all, the Python 3.x version of this part of the string has made a considerable improvement, the specific below will say ....

Unicode in Python 3.x

In the version after Python 3.0, all strings are string sequences encoded with Unicode, along with several improvements:

1. Change the default encoding format to Unicode

2. All Python built-in modules support Unicode

3, no longer support the U ' Chinese ' grammatical format

So, for Python 3.x, the coding problem is no longer a big problem, and rarely meets the above exceptions. For more descriptions and comparisons of Python 2.x Str&unicode and Python 3.x str&bytes, you can look at the summary and comparison of character encodings in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.