Python character encoding and decoding--unicode, str, and Chinese: unicodedecodeerror: ' ASCII ' codec can ' t decode_

Python character encoding and decoding--unicode, str, and Chinese: unicodedecodeerror: ' ASCII ' codec can ' t decode__ encoded

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

absrtact : When writing Python scripts, if we use Python to process Web page data or work with Chinese characters, this error message often occurs: syntaxerror:non-ascii character ' \ Xe6 ' in file./filename.py of Line 3, but no encoding declared. This article focuses on issues related to Unicode and Chinese, and special character encoding in Python. What rules should be followed for character encoding and decoding.

Objective:

If the password domain is the same, from plaintext to password is encrypted, from password to plaintext is decrypted. In Python, the encoding:unicode-->str; decodes str-->unicode. Since it is encoded, as in the domain of cryptography, encoding and decoding are naturally related to encoding/decoding schemes (corresponding to encryption or decryption algorithms), and Unicode is equivalent to plaintext. In Python, the encoding function is encode () and the decoding function is decode (). The point to note is that if we call Str.encode (), where a hermit's type conversion is involved, the STR can now be converted to Unicode to encode, which is also not easy to understand. So, Str.encode () is actually equivalent to Str.decode (sys.defaultencoding). Encode (). Sys.defaultencoding is generally ASCII and it cannot be used to encode Chinese characters.

Between reading this article, if you are not very familiar with character encoding, it is necessary to understand the following character encoding. Refer to: Introduction to character encoding. 1. A Chinese character encoding problem

A Python script is as follows:

  #!/usr/bin/python
  
  string= ' my '
  print string

Run the script, prompting the following message:

Syntaxerror:non-ascii character ' \xe6 ' in file./filename.py on line 3, but no encoding declared

cause of the error:python defaults to ASCII encoding, and the Chinese encoding is no longer within the range that ASCII encoding can represent, so string cannot save "my" as an ASCII encoding as a str type.

Workaround: Use the Chinese character encoding to add the encoding type to the second line of the script, as follows:

#!/usr/bin/python
#coding =gbk
string= ' my '
print string

Here, coding can also use the pattern of utf-8 to encode Chinese characters. 2.python encoding and decoding of characters

Character encoding/decoding function:

1 Unicode: This is Python's built-in function, located in the Unicode class.

Unicode (string [, encoding[, Errors]])-> object

The purpose of this function is to encode a string as a Unicode object in encoding format.

Omitting parameters will be decoded with Python's default ASCII

2) Decode: is located in the Unicode class.

Decode (...)
| S.decode ([encoding[,errors]])-> string or Unicode

|
| Decodes S using the codec registered for encoding.

#!/usr/bin/python
#coding =gbk
string= ' my '
print string
s1=unicode (String, "GBK")
s2= String.decode ("GBK")
print s1
print s2

The output of this code is as follows:
It's mine
鎴戠殑
鎴戠殑

Obviously, the output does not seem to meet our expectations of the results. Why S1 and S2 output is garbled. A string is the Str,print output to the screen, which is related to the character encoding used by the terminal. Why string is normal, and S1 and S2 are garbled. We'll analyze it next.

Also, do you wonder why the Str class is encoded and decoded after the object is Unicode?

Answer: Str.encode () is actually equivalent to Str.decode (sys.defaultencoding). Encode (). Sys.defaultencoding is generally ASCII and it cannot be used to encode Chinese characters.

3) decode and encode can be used for both regular and Unicode strings

But:

Str.decode () and Unicode.encode () are directly and formally used.

Unicode.decode () converts Unicode to STR before executing decode ().

This involves implicit type conversion, What's the 3.codec .

Codec is a combination of Coder/decoder, which defines the conversion of text to binary, unlike ASCII, which converts characters to numbers in one byte, Unicode uses multibyte, which causes Unicode to support many different encodings. For example, the four familiar coding methods that codec supports are: Ascii,iso8859-1/latin-1,utf-8, and UTF-16

The most notable is the UTF-8 encoding, which also encodes ASCII characters in one byte, which makes it easy for programmers who must simultaneously handle both ASCII and Unicode code text, because ASCII characters are UTF-8 encoded and ASCII-encoded exactly the same.

UTF-8 encoding can represent characters in other languages from 1 to 4 bytes, which is troublesome to programmers who need to deal directly with Unicode data because they have no way to read each character in a fixed length, but luckily we don't need to master the method of directly reading Unicode data, Python has done the details for us, and we don't have to worry about dealing with the complex issues of multibyte characters.

UTF-16 is also a variable-length encoding, but it is not commonly used.
4. Coding and decoding

Unicode supports a variety of encoding formats, which puts an extra burden on programmers, and whenever you write a string to a file, you must define an encoding to convert the corresponding Unicode content into the format you defined, Python through the encode of the Unicode string () The function solves this problem by accepting the characters in the string as arguments and outputting the contents of the encoded format you specify.

So every time we write a Unicode string to disk we have to "encode" it with the specified encoder, and, correspondingly, when we read the data from this file, we have to "decode" the file to make it a Unicode string object.
5.python support for Unicode

Built-in Unicode () function: Converts a string of strings into a Unicode object

Decode/encode method: Used to convert a str object into a Unicode object, or vice versa.

Take a look at the following line of examples:

#!/usr/bin/python
#coding =gbk
string= ' my '
print ' string is: ', type (string)
print string

Ustring=u "My"
print "ustring is:", type (ustring)
print ustring

gbkstring=ustring.encode ("GBK")
Print "gbkstring is:", type (gbkstring)
print gbkstring

anotherstring=gbkstring.decode ("GBK")
print Anotherstring is:, type (anotherstring)
print anotherstring

The output results are as follows:

String is: <type ' str ' >
It's mine
Ustring is: <type ' Unicode ' >
鎴戠殑
Gbkstring is: <type ' str ' >
It's mine
Anotherstring is: <type ' Unicode ' >
鎴戠殑

If you want to complete the conversion between any two character encodings, you must pass the Unicode bridge, first to the Unicode object, and the Unicode object to output directly, it will often appear garbled, need to decode into Str object. Also note: Unicode objects, GBK encodings, ASCII encodings, and STR objects are four different concepts. Note what is a string type and what is the encoding type.

6. Matters needing attention

About the principle of character encoding, you can refer to here:

The need to use Unicode in Python requires attention:

1 when a string appears in a program, be sure to add a prefix u

2 Do not use the STR () function, with Unicode () instead

3 do not use an obsolete string module. If you pass it a non-ASCII code, it will mess it up.

4 don't decode Unicode characters in your program until you have to, call the Encode () function and the decode () function only when you are writing to a file or database or network.

5. What character code to use, it is necessary to use the corresponding character set to decode

The built-in STR () and CHR () functions cannot handle Unicode, they can only handle regular ASCII-encoded strings, and if a Unicode string is passed as a parameter to the STR () function, it is first converted into an ASCII string and then handed to the STR () function.

7. About the Linux terminal character encoding

Terminal, such as the default language set in/etc/environment, under Linux, if the terminal is using utf-8 code, then if we use the GBK encoding, it is likely to output to the screen when the garbled.

With the locale command, you can view language-related environment variables:

Hyk@hyk-linux:~/program/python/chapter6
$ locale
lang=zh_cn. UTF-8
language=zh_cn:en_us:en
lc_ctype= "ZH_CN. UTF-8 "
lc_numeric=zh_cn. UTF-8
LC_TIME=ZH_CN. UTF-8
lc_collate= "ZH_CN. UTF-8 "
lc_monetary=zh_cn. UTF-8
lc_messages= "ZH_CN. UTF-8 "
lc_paper=zh_cn. UTF-8
LC_NAME=ZH_CN. UTF-8
LC_ADDRESS=ZH_CN. UTF-8
LC_TELEPHONE=ZH_CN. UTF-8
LC_MEASUREMENT=ZH_CN. UTF-8
LC_IDENTIFICATION=ZH_CN. UTF-8
lc_all=

The Python Print method automatically converts the associated character encoding into the character encoding for the environment variable, so there may be garbled and an error using print, but not with write.

error and solution of character processing in 8.python

Question one:

Strencode=string.encode ("Utf-8")
 print "Strencode is:", type (strencode)
 print Strencode

Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 4:ordinal not in range

Explanation: Str itself is not encode, if you want to encode, first to convert to Unicode, this time using the default ASCII conversion, so there is a mistake.

Solution:

1 indicates the encoding that STR converts to Unicode:

#! /usr/bin/env python   
#-*-coding:utf-8-*-   
  
s = ' Chinese '   
s.decode (' Utf-8 '). Encode (' GB18030 ')

2) resetting the variable sys. defaultencoding

Import sys   
Reload (SYS) # Python2.5 after initialization deletes the Sys.setdefaultencoding method, we need to reload the   
sys.setdefaultencoding (' Utf-8 ')   
  
str = ' Chinese '   
str.encode (' GB18030 ')

Reference documents:

Introduction to "1" character encoding: http://blog.csdn.net/trochiluses/article/details/8782019

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python character encoding and decoding--unicode, str, and Chinese: unicodedecodeerror: ' ASCII ' codec can ' t decode__ encoded

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python character encoding and decoding--unicode, str, and Chinese: unicodedecodeerror: ' ASCII ' codec can ' t decode__ encoded

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support