[Go] Python character encoding details, python character details
Link: http://www.cnblogs.com/huxi/archive/2010/12/05/1897271.html
1. character encoding Introduction 1.1. ASCII
ASCII (American Standard Code for Information Interchange) is a single-byte encoding. In the computer world, only English is used at the beginning, and a single byte can represent 256 different characters, which can represent all English characters and many control symbols. However, ASCII only uses half of them (less than \ x80), which is also the basis for implementation of MBCS.
1.2. MBCS
However, in the computer world, other languages will soon be available, and single-byte ASCII cannot meet the requirements. Later, each language developed a set of its own encoding. because there are too few characters that can be expressed in a single word, and it also needs to be compatible with ASCII encoding, therefore, these encodings use multiple bytes to represent characters, such as GBxxx and BIGxxx. Their rule is that if the first byte is less than \ x80, it will still represent ASCII characters; if the parameter is greater than \, it indicates a character together with the next byte (two bytes in total), skips the next byte, and continues to judge.
Here, IBM invented the concept of Code Page, which included the Code into the bag and assigned the Page number. GBK is the 936th Page, that is, CP936. Therefore, CP936 can also be used to represent GBK.
MBCS (Multi-Byte Character Set) are collectively referred to as these codes. So far, we have used Double bytes, so sometimes it is also called DBCS (Double-Byte Character Set ). It must be clear that MBCS is not a specific encoding. In Windows, MBCS indicates different Encoding Based on the region you set. in Linux, MBCS cannot be used as the encoding. In Windows, you cannot see the MBCS characters, because Microsoft uses ANSI to scare people in order to make them more foreign. In the Save As dialog box of notepad, the encoding ANSI is MBCS. At the same time, GBK is used in the default region settings of Simplified Chinese Windows.
1.3. Unicode
Later, some people began to think that too much encoding made the world too complicated and painful, so they sat together and shoot their heads to come up with a method: all the characters in the language are represented by the same character set, this is Unicode.
The original Unicode Standard UCS-2 uses two bytes to represent a single character, so you can often hear Unicode uses two bytes to represent a single character. But soon some people think that 256*256 is too small, or not enough, so there is a UCS-4 standard, it uses 4 bytes to represent a character, but what we use most is UCS-2.
The Unicode Character Set is only a table of characters corresponding to the bitwise. For example, the bitwise of the Character "Han" is 6C49. UTF (uctransformation Format) is responsible for the specific transmission and storage of characters.
At the beginning, it was very simple to store directly using the UCOS code bit, which is the UTF-16, for example, "Han" directly using \ x6C \ x49 to save (UTF-16-BE ), or use \ x49 \ x6C to save (UTF-16-LE ). However, it is used when Americans feel that they have suffered a big loss. In the past, English letters only needed one byte to store them. Now, when a big pot meal is eaten, it becomes two bytes, space consumption doubled ...... So UTF-8 turned out.
The UTF-8 is a very awkward encoding, represented in a variable length and compatible with ASCII, where ASCII characters are expressed in 1 byte. However, what is saved here must be pulled out from other places. You must have heard that Chinese characters in the UTF-8 are saved in 3 bytes, right? The four-byte characters are even more tearful ...... (Specific UCS-2 is how to become a UTF-8 please search)
It is also worth mentioning that BOM (Byte Order Mark ). When we store a file, the encoding used by the file is not saved. When opening the file, we need to remember the encoding used for saving and use this encoding to open it. This causes a lot of trouble. (You may want to say that the selected encoding is not allowed when the notepad opens the file? Open notepad first and then use the file-> open to see). UTF introduces BOM to indicate its own encoding. If one of the first several bytes is read, the encoding of the text to be read is as follows:
BOM_UTF8 '\xef\xbb\xbf'
BOM_UTF16_LE '\xff\xfe'
BOM_UTF16_BE '\xfe\xff'
Not all editors write data to the BOM, But Unicode can still be read even if there is no BOM. The encoding is the same as the encoding of MBCS. Otherwise, the decoding will fail.
You may have heard that the UTF-8 does not need BOM. This is not true, but most Editors Use UTF-8 as the default encoding when there is no BOM. Even if it is saved with an ANSI (MBCS) notepad by default, the UTF-8 test encoding is used when reading the file, and if it can be decoded successfully, the UTF-8 decoding is used. This awkward practice of notepad creates a BUG: If you create a text file and input "Audio Encoding", save it with ANSI (MBCS), and then open it, it becomes "Han ", try again :)
2. Encoding Problems in Python2.x: 2.1. str and unicode
Both str and unicode are subclasses of basestring. Strictly speaking, str is actually a byte string, which is a sequence of unicode encoded bytes. When using the len () function for str 'hangzhou' encoded in the UTF-8, the result is 3 because, in fact, the UTF-8 encoded 'hangzhou' = '\ xE6 \ xB1 \ x89 '.
Unicode is a true string. It is obtained after correct character encoding for the byte string str, and len (u'han') = 1.
Let's take a look at the two basestring instance methods of encode () and decode (). After understanding the difference between str and unicode, the two methods will not be confused:
12345678910111213 |
# coding: UTF-8 u = u 'Han' print repr (u) # u'\u6c49' s = u.encode( 'UTF-8' ) print repr (s) # '\xe6\xb1\x89' u2 = s.decode( 'UTF-8' ) print repr (u2) # u'\u6c49' # Unicode decoding is incorrect. # s2 = u.decode('UTF-8') # Similarly, encoding str is also incorrect. # u2 = s.encode('UTF-8') |
It should be noted that, although calling the encode () method on str is incorrect, in fact, Python does not throw an exception, but returns another str with the same content but different IDs; the same is true for calling the decode () method for unicode. I don't understand why I didn't put encode () and decode () in unicode and str respectively, but in basestring. But now, we should be careful to avoid mistakes.
2.2. character encoding Declaration
In the source code file, if non-ASCII characters are used, you must declare the character encoding in the file header, as follows:
In fact, Python only checks the #, coding, and encoding strings. Other characters are added for the sake of beauty. In addition, there are many characters available in Python, and there are many alias, not case sensitive, such as UTF-8 can be written as u8. See http://docs.python.org/library/codecs.html?standard-encodings.
In addition, it should be noted that the declared encoding must be consistent with the encoding used when the file is actually saved. Otherwise, an exception occurs in code parsing. Currently, the IDE will automatically handle this situation. After changing the declaration, it will replace it with the declared encoding for saving, but the text editor controller should be careful :)
2.3. Read and Write files
When the built-in open () method is used to open a file, the read () method reads str, and the correct encoding format must be used for decode (). During write () writing, if the parameter is unicode, you need to use the encoding you want to write to encode (). If it is another encoding format of str, decode () using the str encoding, convert it to unicode, and then encode () using the written encoding (). If unicode is directly passed into the write () method as a parameter, Python will first use the character encoding declared in the source code file for encoding and then write.
1234567891011121314 |
# coding: UTF-8 f = open ( 'test.txt' ) s = f.read() f.close() print type (s) # <type 'str'> # GBK encoding is known and decoded to unicode u = s.decode( 'GBK' ) f = open ( 'test.txt' , 'w' ) # Encode str into a UTF-8 s = u.encode( 'UTF-8' ) f.write(s) f.close() |
In addition, the module codecs provides an open () method, which allows you to specify an encoding to open a file. The unicode is returned for reading the file opened by this method. When writing, if the parameter is unicode, the encoding specified during open () is used for encoding and writing. If the parameter is str, the encoding is based on the character encoding declared in the source code file, after decoding to unicode, perform the preceding operations. Compared with the built-in open (), this method is not prone to coding problems.
1234567891011121314151617181920 |
# coding: GBK import codecs f = codecs. open ( 'test.txt' , encoding = 'UTF-8' ) u = f.read() f.close() print type (u) # <type 'unicode'> f = codecs. open ( 'test.txt' , 'a' , encoding = 'UTF-8' ) # Writing unicode f.write(u) # Write str to automatically perform decoding and encoding # GBK-encoded str s = 'Han' print repr (s) # '\xba\xba' # Here, the str encoded in GBK is decoded to unicode and then encoded as a UTF-8. f.write(s) f.close() |
2.4. encoding-related methods
The sys/locale module provides some methods to obtain the default encoding in the current environment.
12345678910111213141516171819202122232425262728293031 |
# coding:gbk import sys import locale def p(f): print '%s.%s(): %s' % (f.__module__, f.__name__, f()) # Return the default character encoding used by the current system p(sys.getdefaultencoding) # Return the encoding used to convert the Unicode file name to the System File Name p(sys.getfilesystemencoding) # Obtain the default region settings and return the ancestor (language, encoding) p(locale.getdefaultlocale) # Return the User-Defined text data encoding # This function only returns a guess p(locale.getpreferredencoding) # \ Xba is 'hangzhou' GBK Encoding # Mbcs is not recommended encoding. Here we only test it to show why it should not be used. print r "'\xba\xba'.decode('mbcs'):" , repr ( '\xba\xba' .decode( 'mbcs' )) # My results on Windows (region is set to Chinese (simplified, Chinese )) #sys.getdefaultencoding(): gbk #sys.getfilesystemencoding(): mbcs #locale.getdefaultlocale(): ('zh_CN', 'cp936') #locale.getpreferredencoding(): cp936 #'\xba\xba'.decode('mbcs'): u'\u6c49' |
3. Some suggestions 3.1. Use the character encoding declaration, and all source code files in the same project use the same character encoding declaration.
This must be done.
3.2. Discard str and use unicode all.
Before you press the quotation mark, you should first press u. It is really not used to do it at first, and you will often forget to run it back to supplement it. However, if you do this, you can reduce the encoding problem by 90%. If the encoding problem is not serious, you can refer to this article.
3.3. Use codecs. open () to replace the built-in open ().
If the encoding problem is not serious, you can refer to this article.
3.4. The character encoding must be avoided: MBCS/DBCS and UTF-16.
The MBCS mentioned here does not mean that GBK cannot be used, but does not use the encoding named 'mbcs 'in Python unless the program is not transplanted at all.
In Python, encoding 'mbcs 'and 'dbcs' are synonyms, which refer to the encoding of MBCS in the current Windows environment. This encoding is not available in the Python Implementation of Linux. Therefore, an exception will occur once it is transplanted to Linux! In addition, as long as the configured Windows system region is different, the MBCS encoding is also different. Set the results of running the code in section 2.4 in different regions:
123456789101112131415161718192021222324252627 |
# Chinese (Simplified Chinese) #sys.getdefaultencoding(): gbk #sys.getfilesystemencoding(): mbcs #locale.getdefaultlocale(): ('zh_CN', 'cp936') #locale.getpreferredencoding(): cp936 #'\xba\xba'.decode('mbcs'): u'\u6c49' # English (USA) #sys.getdefaultencoding(): UTF-8 #sys.getfilesystemencoding(): mbcs #locale.getdefaultlocale(): ('zh_CN', 'cp1252') #locale.getpreferredencoding(): cp1252 #'\xba\xba'.decode('mbcs'): u'\xba\xba' # German) #sys.getdefaultencoding(): gbk #sys.getfilesystemencoding(): mbcs #locale.getdefaultlocale(): ('zh_CN', 'cp1252') #locale.getpreferredencoding(): cp1252 #'\xba\xba'.decode('mbcs'): u'\xba\xba' # Japanese) #sys.getdefaultencoding(): gbk #sys.getfilesystemencoding(): mbcs #locale.getdefaultlocale(): ('zh_CN', 'cp932') #locale.getpreferredencoding(): cp932 #'\xba\xba'.decode('mbcs'): u'\uff7a\uff7a' |
It can be seen that after the region is changed, incorrect results are obtained through mbcs decoding. Therefore, when we need to use 'gbk', we should directly write 'gbk' instead of 'mbcs '.
Similarly to the UTF-16, although 'utf-16' is a synonym for 'utf-16-le' in most operating systems, writing 'utf-16-le' directly only writes three more characters, in case that 'utf-16' in an operating system is changed to a synonym of 'utf-16-be', an error will occur. In fact, the use of UTF-16 is quite few, but the use of time still need to pay attention.
-- END --