1. Introduction to character encoding 1.1. Ascii
ASCII (American Standard Code for Information Interchange) is a single-byte encoding . The computer world begins with English, whereas a single byte can represent 256 different characters, which can represent all English characters and many control symbols. However, ASCII only uses half of them (\x80), which is the basis for the implementation of MBCS.
1.2. MBCS
However, the computer world soon has other languages, and single-byte ASCII is not enough to meet the requirements. Later, each language developed a set of its own code, because the word energy-saving representation of the characters are too few, but also need to be compatible with the ASCII encoding, so these encodings have used multibyte to represent characters, such as gbxxx,bigxxx, and so on, their rules are, If the first byte is below \x80, the ASCII character is still represented, and if it is above \x80, the next byte (a total of two bytes) is represented by one character, and then the next byte is skipped, continuing to judge.
Here, IBM invented a concept called code page, which all proceeds into the bag and assigns page numbers, GBK is the No. 936 page, CP936. Therefore, you can also use CP936 to represent GBK.
MBCS (Multi-Byte Character Set) is a generic term for these encodings. so far everyone has used double-byte, so it is sometimes called DBCS (Double-byte Character Set). It is important to be clear thatMBCS is not a specific encoding , and that in Windows, depending on the region you set, MBCS refers to different encodings, and it is not possible to use MBCS as the encoding in Linux. You can't see the MBCS characters in Windows because Microsoft is using ANSI to scare people in order to be more foreign flavor, and the Save As dialog box for Notepad is code ANSI is MBCS. Also, in the default locale of the simplified Chinese windows, refer to GBK.
1.3. Unicode
Later, someone began to think that too much coding caused the world to become too complex, so that the brain hurts, so we sit together and shoot the head to come up with a method: All language characters are expressed in the same character set, which is Unicode.
The original Unicode standard UCS-2 uses two bytes to represent one character, so you can often hear the assertion that Unicode uses two bytes to represent a character. But soon some people think 256*256 too little, or not enough, so there is a UCS-4 standard, it uses 4 bytes to represent a character, but we use the most is still UCS-2.
The UCS (Unicode Character Set) is also just a table of characters corresponding to the code bit, such as the code bit of the word "Han" is 6c49. the exact transmission and storage of the characters is the responsibility of UTF (UCS transformation Format).
At first it was very simple to use the code bit of the UCS to save it, which is UTF-16, for example, "Han" is stored directly with \x6c\x49 (utf-16-be), or is used upside down with \x49\x6c save ( Utf-16-le). but using the Americans feel that they eat a big loss, before the English alphabet only need a byte to be able to save, and now same big pot A eat into two bytes, the space consumption is one times greater ... So UTF-8 turned out to be born.
UTF-8 is a very awkward code that behaves in a way that is longer and compatible with ASCII,ASCII characters using 1-byte representations. However, it must have been extracted from somewhere else, and you must have heard of the UTF-8 characters using 3 bytes to save it? 4 bytes saved characters are more in tears ... (Specific UCS-2 is how to become UTF-8, please search by yourself)
Also worth mentioning is the BOM (Byte Order Mark). When we save the file, the encoding used for the file is not saved, and when we open it we need to remember the encoding we used when we saved it and open it with this code, which creates a lot of trouble. (You might want to say that Notepad does not have a selection code when it opens the file?) You might want to open Notepad before you open it with a file--to see it. UTF introduces a BOM to represent its own encoding, and if the first few bytes read are one of them, then the encoding used for the text to be read Next is the corresponding encoding:
Bom_utf8 ' \XEF\XBB\XBF '
Bom_utf16_le ' \xff\xfe '
Bom_utf16_be ' \xfe\xff '
Not all editors write to the BOM, but even if no bom,unicode can be read, just like MBCS encoding, you need to specify a specific encoding, otherwise the decoding will fail.
You may have heard that UTF-8 does not require a BOM, which is not true, except that most editors are read with UTF-8 as the default code when they do not have a BOM . Even if you use the ANSI (MBCS) Notepad by default on save, the UTF-8 test encoding is used first when reading the file, and if it can be decoded successfully, use UTF-8 decoding. Notepad this awkward practice caused a bug: If you create a new text file and enter "Cha 塧" and then use ANSI (MBCS) to save, and then open it will become "Han a", you might as well try:)
2. Coding problem in python2.x 2.1. STR and Unicode
Both STR and Unicode are subclasses of the basestring. In strict sense, str is actually a byte string, which is a sequence of Unicode encoded bytes. when using the Len () function for UTF-8 encoded str ' Han ', the result is 3, because in fact, UTF-8 encodes the ' han ' = = ' \xe6\xb1\x89 '.
Unicode is the real string, which is obtained by decoding the byte string str with the correct character encoding, and Len (U ' han ') = = 1.
Take a look at the example methods of encode () and Decode () two basestring, and after understanding the differences between STR and Unicode, the two methods will no longer be confused:
12345678910111213 |
# coding: UTF-8 u
= u
‘汉‘
print repr
(u)
# u‘\u6c49‘
s
= u.encode(
‘UTF-8‘
)
print repr
(s)
# ‘\xe6\xb1\x89‘
u2
= s.decode(
‘UTF-8‘
)
print repr
(u2)
# u‘\u6c49‘
# 对unicode进行解码是错误的
# s2 = u.decode(‘UTF-8‘)
# 同样,对str进行编码也是错误的
# u2 = s.encode(‘UTF-8‘)
|
It is important to note that although calling the Encode () method on Str is wrong , Python does not actually throw an exception, but rather returns a different str with the same content but with various IDs, and the Decode () method for Unicode calls. . I do not understand why encode () and decode () are not placed in Unicode and STR, but in basestring, but since this is the case, we should be careful to avoid making mistakes.
If you run this code under the Windows console, the program is executed, but the print on the screen is not a word. This is due to inconsistencies between the Python code and the console encoding. The encoding in the console under Windows uses GBK, and the Utf-8,python used in the code will naturally be inconsistent without printing the correct kanji by utf-8 encoding to the GBK encoded console.
type (' Ha '), which will get <type ' str ', while type (U ' ha '), will get <type ' Unicode ', that is, before the character plus u to indicate that this is a Unicode object , This word will exist in memory in Unicode format, and if you do not add u, this is just a string using some encoding, the encoding format depends on Python's identification of the source file encoding, here is utf-8. Python automatically converts the Unicode object to the console based on the encoding of the output environment, but if the output is not a Unicode object but an ordinary string, the output string will be printed directly by the encoding of the string, resulting in the above phenomenon.
2.2. Character encoding Declaration
in the source code file, if it is useful to non-ASCII characters, you need to make a declaration of character encoding at the head of the file as follows:
tell Python that the text in this file is encoded with utf-8 so that Python interprets the characters in the Utf-8 encoding and converts it into Unicode encoding for internal processing .
In fact Python only checks #, coding, and encoded strings, and all the other characters are added for aesthetics. In addition, there are many character encodings available in Python, and there are many aliases that are not case sensitive, such as UTF-8 can be written as U8. See Http://docs.python.org/library/codecs.html#standard-encodings.
It is also important to note that the encoding of the declaration must be the same as the encoding used when the file is actually saved , otherwise there is a large chance of a code parsing exception. Now the IDE will generally automatically handle this situation, change the declaration and then replace the declaration of the encoding to save, but the text editor controls need to be careful:)
2.3. Read and Write files
When the built-in open () method opens the file, read() reads Str, and after reading it needs to be decode () using the correct encoding format. Write (), if the parameter is Unicode, you need to encode () with the encoding you wish to write, and if it is a different encoded format str, you need to first decode () with that Str's encoding, Convert to Unicode and then use the written encoding for Encode (). If you pass Unicode as a parameter directly to the write () method, Python will first encode and write using the character encoding declared by the source code file.
1234567891011121314 |
# coding: UTF-8 f
= open
(
‘test.txt‘
)
s
= f.read()
f.close()
print type
(s)
# <type ‘str‘>
# 已知是GBK编码,解码成unicode
u
= s.decode(
‘GBK‘
)
f
= open
(
‘test.txt‘
,
‘w‘
)
# 编码成UTF-8编码的str
s
= u.encode(
‘UTF-8‘
)
f.write(s)
f.close()
|
In addition, module codecs provides an open () method that can specify an encoding for opening a file, and using this method to open a file read will return the Unicode. When writing, if the parameter is Unicode, the encoding specified with open () is encoded and then written, and if it is STR, it is first decoded into Unicode based on the character encoding declared by the source code file before the aforementioned operation. Compared with the built-in open (), this method is less prone to coding problems.
1234567891011121314151617181920 |
# coding: GBK
import codecs
f
= codecs.
open
(
‘test.txt‘
, encoding
=
‘UTF-8‘
)
u
= f.read()
f.close()
print type
(u)
# <type ‘unicode‘>
f
= codecs.
open
(
‘test.txt‘
,
‘a‘
, encoding
=
‘UTF-8‘
)
# 写入unicode
f.write(u)
# 写入str,自动进行解码编码操作
# GBK编码的str
s
= ‘汉‘
print repr
(s)
# ‘\xba\xba‘
# 这里会先将GBK编码的str解码为unicode再编码为UTF-8写入
f.write(s)
f.close()
|
2.4. Coding-related methods
Some methods for getting the default encoding in the current environment are provided in the Sys/locale module.
12345678910111213141516171819202122232425262728293031 |
# coding:gbk
import sys
import locale
def p(f):
print ‘%s.%s(): %s‘ % (f.__module__, f.__name__, f())
# 返回当前系统所使用的默认字符编码
p(sys.getdefaultencoding) # 返回用于转换Unicode文件名至系统文件名所使用的编码
p(sys.getfilesystemencoding)
# 获取默认的区域设置并返回元祖(语言, 编码)
p(locale.getdefaultlocale)
# 返回用户设定的文本数据编码
# 文档提到this function only returns a guess
p(locale.getpreferredencoding)
# \xba\xba是‘汉‘的GBK编码
# mbcs是不推荐使用的编码,这里仅作测试表明为什么不应该用
print r
"‘\xba\xba‘.decode(‘mbcs‘):"
,
repr
(
‘\xba\xba‘
.decode(
‘mbcs‘
))
#在笔者的Windows上的结果(区域设置为中文(简体, 中国))
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): (‘zh_CN‘, ‘cp936‘)
#locale.getpreferredencoding(): cp936
#‘\xba\xba‘.decode(‘mbcs‘): u‘\u6c49‘
|
3. Some recommendations 3.1. A character encoding declaration is used, and all source code files in the same project use the same character encoding declaration.
This is something that must be done.
3.2. Discard str, all using Unicode.
Press the quotation mark before you first do it really is not used to do and often forget to run back to fill, but if you can reduce the coding problem 90%. If the coding problem is not serious, you may not refer to this article.
3.3. Replace the built-in open () with Codecs.open ().
If the coding problem is not serious, you may not refer to this article.
3.4. Absolute need to avoid the use of character encoding: Mbcs/dbcs and UTF-16.
MBCS is not meant to be used by GBK or anything, but instead of using Python's code called ' MBCS ', unless the program is completely non-ported.
The encoding ' MBCS ' in Python is synonymous with ' DBCS ', which refers to the encoding of the MBCS reference in the current Windows environment. There is no such code in the Python implementation of Linux, so there is always an exception when porting to Linux! Also, the code for MBCS refers to a different set of Windows system regions. Set the results of the code in section 2.4 of the different zones to run separately:
123456789101112131415161718192021222324252627 |
#中文(简体, 中国)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): (‘zh_CN‘, ‘cp936‘)
#locale.getpreferredencoding(): cp936
#‘\xba\xba‘.decode(‘mbcs‘): u‘\u6c49‘
#英语(美国)
#sys.getdefaultencoding(): UTF-8
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): (‘zh_CN‘, ‘cp1252‘)
#locale.getpreferredencoding(): cp1252
#‘\xba\xba‘.decode(‘mbcs‘): u‘\xba\xba‘
#德语(德国)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): (‘zh_CN‘, ‘cp1252‘)
#locale.getpreferredencoding(): cp1252
#‘\xba\xba‘.decode(‘mbcs‘): u‘\xba\xba‘
#日语(日本)
#sys.getdefaultencoding(): gbk
#sys.getfilesystemencoding(): mbcs
#locale.getdefaultlocale(): (‘zh_CN‘, ‘cp932‘)
#locale.getpreferredencoding(): cp932
#‘\xba\xba‘.decode(‘mbcs‘): u‘\uff7a\uff7a‘
|
As can be seen, after changing the area, using MBCS decoding to get incorrect results, so when we need to use ' GBK ', should write ' GBK ' directly, do not write ' MBCS '.
UTF-16 Similarly, while the vast majority of operating systems ' UTF-16 ' is synonymous with ' utf-16-le ', writing ' Utf-16-le ' only writes 3 more characters, and in case an operating system ' UTF-16 ' becomes synonymous with ' utf-16-be ', There will be the wrong result. In fact, UTF-16 used quite a few, but it still needs attention when used.
--end--
python-character encoding detailed