absrtact : When writing Python scripts, if we use Python to process Web page data or work with Chinese characters, this error message often occurs: syntaxerror:non-ascii character ' \ Xe6 ' in file./filename.py of Line 3, but no encoding declared. This article focuses on issues related to Unicode and Chinese, and special character encoding in Python. What rules should be followed for character encoding and decoding.
Objective:
If the password domain is the same, from plaintext to password is encrypted, from password to plaintext is decrypted. In Python, the encoding:unicode-->str; decodes str-->unicode. Since it is encoded, as in the domain of cryptography, encoding and decoding are naturally related to encoding/decoding schemes (corresponding to encryption or decryption algorithms), and Unicode is equivalent to plaintext. In Python, the encoding function is encode () and the decoding function is decode (). The point to note is that if we call Str.encode (), where a hermit's type conversion is involved, the STR can now be converted to Unicode to encode, which is also not easy to understand. So, Str.encode () is actually equivalent to Str.decode (sys.defaultencoding). Encode (). Sys.defaultencoding is generally ASCII and it cannot be used to encode Chinese characters.
Between reading this article, if you are not very familiar with character encoding, it is necessary to understand the following character encoding. Refer to: Introduction to character encoding. 1. A Chinese character encoding problem
A Python script is as follows:
#!/usr/bin/python
string= ' my '
print string
Run the script, prompting the following message:
Syntaxerror:non-ascii character ' \xe6 ' in file./filename.py on line 3, but no encoding declared
cause of the error:python defaults to ASCII encoding, and the Chinese encoding is no longer within the range that ASCII encoding can represent, so string cannot save "my" as an ASCII encoding as a str type.
Workaround: Use the Chinese character encoding to add the encoding type to the second line of the script, as follows:
#!/usr/bin/python
#coding =gbk
string= ' my '
print string
Here, coding can also use the pattern of utf-8 to encode Chinese characters. 2.python encoding and decoding of characters
Character encoding/decoding function:
1 Unicode: This is Python's built-in function, located in the Unicode class.
Unicode (string [, encoding[, Errors]])-> object
The purpose of this function is to encode a string as a Unicode object in encoding format.
Omitting parameters will be decoded with Python's default ASCII
2) Decode: is located in the Unicode class.
Decode (...)
| S.decode ([encoding[,errors]])-> string or Unicode
|
| Decodes S using the codec registered for encoding.
#!/usr/bin/python
#coding =gbk
string= ' my '
print string
s1=unicode (String, "GBK")
s2= String.decode ("GBK")
print s1
print s2
The output of this code is as follows:
It's mine
鎴 戠 殑
鎴 戠 殑
Obviously, the output does not seem to meet our expectations of the results. Why S1 and S2 output is garbled. A string is the Str,print output to the screen, which is related to the character encoding used by the terminal. Why string is normal, and S1 and S2 are garbled. We'll analyze it next.
Also, do you wonder why the Str class is encoded and decoded after the object is Unicode?
Answer: Str.encode () is actually equivalent to Str.decode (sys.defaultencoding). Encode (). Sys.defaultencoding is generally ASCII and it cannot be used to encode Chinese characters.
3) decode and encode can be used for both regular and Unicode strings
But:
Str.decode () and Unicode.encode () are directly and formally used.
Unicode.decode () converts Unicode to STR before executing decode ().
This involves implicit type conversion, What's the 3.codec .
Codec is a combination of Coder/decoder, which defines the conversion of text to binary, unlike ASCII, which converts characters to numbers in one byte, Unicode uses multibyte, which causes Unicode to support many different encodings. For example, the four familiar coding methods that codec supports are: Ascii,iso8859-1/latin-1,utf-8, and UTF-16
The most notable is the UTF-8 encoding, which also encodes ASCII characters in one byte, which makes it easy for programmers who must simultaneously handle both ASCII and Unicode code text, because ASCII characters are UTF-8 encoded and ASCII-encoded exactly the same.
UTF-8 encoding can represent characters in other languages from 1 to 4 bytes, which is troublesome to programmers who need to deal directly with Unicode data because they have no way to read each character in a fixed length, but luckily we don't need to master the method of directly reading Unicode data, Python has done the details for us, and we don't have to worry about dealing with the complex issues of multibyte characters.
UTF-16 is also a variable-length encoding, but it is not commonly used.
4. Coding and decoding
Unicode supports a variety of encoding formats, which puts an extra burden on programmers, and whenever you write a string to a file, you must define an encoding to convert the corresponding Unicode content into the format you defined, Python through the encode of the Unicode string () The function solves this problem by accepting the characters in the string as arguments and outputting the contents of the encoded format you specify.
So every time we write a Unicode string to disk we have to "encode" it with the specified encoder, and, correspondingly, when we read the data from this file, we have to "decode" the file to make it a Unicode string object.
5.python support for Unicode
Built-in Unicode () function: Converts a string of strings into a Unicode object
Decode/encode method: Used to convert a str object into a Unicode object, or vice versa.
Take a look at the following line of examples:
#!/usr/bin/python
#coding =gbk
string= ' my '
print ' string is: ', type (string)
print string
Ustring=u "My"
print "ustring is:", type (ustring)
print ustring
gbkstring=ustring.encode ("GBK")
Print "gbkstring is:", type (gbkstring)
print gbkstring
anotherstring=gbkstring.decode ("GBK")
print Anotherstring is:, type (anotherstring)
print anotherstring
The output results are as follows:
String is: <type ' str ' >
It's mine
Ustring is: <type ' Unicode ' >
鎴 戠 殑
Gbkstring is: <type ' str ' >
It's mine
Anotherstring is: <type ' Unicode ' >
鎴 戠 殑
If you want to complete the conversion between any two character encodings, you must pass the Unicode bridge, first to the Unicode object, and the Unicode object to output directly, it will often appear garbled, need to decode into Str object. Also note: Unicode objects, GBK encodings, ASCII encodings, and STR objects are four different concepts. Note what is a string type and what is the encoding type.
6. Matters needing attention
About the principle of character encoding, you can refer to here:
The need to use Unicode in Python requires attention:
1 when a string appears in a program, be sure to add a prefix u
2 Do not use the STR () function, with Unicode () instead
3 do not use an obsolete string module. If you pass it a non-ASCII code, it will mess it up.
4 don't decode Unicode characters in your program until you have to, call the Encode () function and the decode () function only when you are writing to a file or database or network.
5. What character code to use, it is necessary to use the corresponding character set to decode
The built-in STR () and CHR () functions cannot handle Unicode, they can only handle regular ASCII-encoded strings, and if a Unicode string is passed as a parameter to the STR () function, it is first converted into an ASCII string and then handed to the STR () function.
7. About the Linux terminal character encoding
Terminal, such as the default language set in/etc/environment, under Linux, if the terminal is using utf-8 code, then if we use the GBK encoding, it is likely to output to the screen when the garbled.
With the locale command, you can view language-related environment variables:
Hyk@hyk-linux:~/program/python/chapter6
$ locale
lang=zh_cn. UTF-8
language=zh_cn:en_us:en
lc_ctype= "ZH_CN. UTF-8 "
lc_numeric=zh_cn. UTF-8
LC_TIME=ZH_CN. UTF-8
lc_collate= "ZH_CN. UTF-8 "
lc_monetary=zh_cn. UTF-8
lc_messages= "ZH_CN. UTF-8 "
lc_paper=zh_cn. UTF-8
LC_NAME=ZH_CN. UTF-8
LC_ADDRESS=ZH_CN. UTF-8
LC_TELEPHONE=ZH_CN. UTF-8
LC_MEASUREMENT=ZH_CN. UTF-8
LC_IDENTIFICATION=ZH_CN. UTF-8
lc_all=
The Python Print method automatically converts the associated character encoding into the character encoding for the environment variable, so there may be garbled and an error using print, but not with write.
error and solution of character processing in 8.python
Question one:
Strencode=string.encode ("Utf-8")
print "Strencode is:", type (strencode)
print Strencode
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 4:ordinal not in range
Explanation: Str itself is not encode, if you want to encode, first to convert to Unicode, this time using the default ASCII conversion, so there is a mistake.
Solution:
1 indicates the encoding that STR converts to Unicode:
#! /usr/bin/env python
#-*-coding:utf-8-*-
s = ' Chinese '
s.decode (' Utf-8 '). Encode (' GB18030 ')
2) resetting the variable sys. defaultencoding
Import sys
Reload (SYS) # Python2.5 after initialization deletes the Sys.setdefaultencoding method, we need to reload the
sys.setdefaultencoding (' Utf-8 ')
str = ' Chinese '
str.encode (' GB18030 ')
Reference documents:
Introduction to "1" character encoding: http://blog.csdn.net/trochiluses/article/details/8782019