Deep analysis of Python character encoding

Source: Internet
Author: User
Python's string coding rules have always been a headache for me, and it took some time to study it, not complicated. Mainly involved in the characteristics of commonly used character encoding, and describes how to combat coding problems in the python2.x, this article on the Python content only applies to 2.x,3.x str and Unicode have a tremendous change, please refer to the relevant information.

1. Introduction to character encoding

1.1. ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte encoding. The computer world begins with English, whereas a single byte can represent 256 different characters, which can represent all English characters and many control symbols. However, ASCII only uses half of them (\x80), which is the basis for the implementation of MBCS.

1.2. MBCS

However, the computer world soon has other languages, and single-byte ASCII is not enough to meet the requirements. Later, each language developed a set of its own code, because the word energy-saving representation of too few characters, but also need to be compatible with the ASCII encoding, so these encodings are used multibyte to represent characters, such as Gbxxx, bigxxx and so on, their rule is, if the first byte is \x80 below, The ASCII character is still represented, and if it is above \x80, the next byte (a total of two bytes) is represented by one character, and then the next byte is skipped, continuing to judge.

Here, IBM invented a concept called code page, which all proceeds into the bag and assigns page numbers, GBK is the No. 936 page, CP936. Therefore, you can also use CP936 to represent GBK.

MBCS (Multi-Byte Character Set) is a generic term for these encodings. So far everyone has used double-byte, so it is sometimes called DBCS (Double-byte Character Set). It is important to be clear that MBCS is not a specific encoding, and that in Windows, depending on the region you set, MBCS refers to different encodings, and it is not possible to use MBCS as the encoding in Linux. You can't see the MBCS characters in Windows because Microsoft is using ANSI to scare people in order to be more foreign flavor, and the Save As dialog box for Notepad is code ANSI is MBCS. Also, in the default locale of the simplified Chinese windows, refer to GBK.

1.3. Unicode

Later, someone began to think that too much coding caused the world to become too complex, so that the brain hurts, so we sit together and shoot the head to come up with a method: All language characters are expressed in the same character set, which is Unicode.

The original Unicode standard UCS-2 uses two bytes to represent one character, so you can often hear the assertion that Unicode uses two bytes to represent a character. But soon some people think 256*256 too little, or not enough, so there is a UCS-4 standard, it uses 4 bytes to represent a character, but we use the most is still UCS-2.

The UCS (Unicode Character Set) is also just a table of characters corresponding to the code bit, such as the code bit of the word "Han" is 6c49. The exact transmission and storage of the characters is the responsibility of UTF (UCS transformation Format).

At first it was very simple to use the code bit of the UCS to save it, which is UTF-16, for example, "Han" is stored directly using \x6c\x49 (utf-16-be), or \x49\x6c (Utf-16-le) is used upside down. But using the Americans feel that they eat a big loss, before the English alphabet only need a byte to be able to save, and now same big pot A eat into two bytes, the space consumption is one times greater ... So UTF-8 turned out to be born.

UTF-8 is a very awkward code that behaves in a way that is longer and compatible with ASCII,ASCII characters using 1-byte representations. However, it must have been extracted from somewhere else, and you must have heard of the UTF-8 characters using 3 bytes to save it? 4 bytes saved characters are more in tears ... (Specific UCS-2 is how to become UTF-8, please search by yourself)

Also worth mentioning is the BOM (Byte Order Mark). When we save the file, the encoding used for the file is not saved, and when we open it we need to remember the encoding we used when we saved it and open it with this code, which creates a lot of trouble. (You might want to say that Notepad does not have a selection code when it opens the file?) You might want to open Notepad before you open it with a file--to see it. UTF introduces a BOM to represent its own encoding, and if the first few bytes read are one of them, then the encoding used for the text to be read Next is the corresponding encoding:

Bom_utf8 ' \XEF\XBB\XBF '
Bom_utf16_le ' \xff\xfe '
Bom_utf16_be ' \xfe\xff '

Not all editors write to the BOM, but even if no bom,unicode can be read, just like MBCS encoding, you need to specify a specific encoding, otherwise the decoding will fail.

You may have heard that UTF-8 does not require a BOM, which is not true, except that most editors are read with UTF-8 as the default code when they do not have a BOM. Even if you use the ANSI (MBCS) Notepad by default on save, the UTF-8 test encoding is used first when reading the file, and if it can be decoded successfully, use UTF-8 decoding. Notepad this awkward practice caused a bug: If you create a new text file and enter "Cha 塧" and then use ANSI (MBCS) to save, and then open it will become "Han a", you might as well try:)

2. Coding issues in python2.x

2.1. STR and Unicode

Both STR and Unicode are subclasses of the basestring. In strict sense, str is actually a byte string, which is a sequence of Unicode encoded bytes. When using the Len () function for UTF-8 encoded str ' Han ', the result is 3, because in fact, UTF-8 encodes the ' han ' = = ' \xe6\xb1\x89 '.

Unicode is the real string, which is obtained by decoding the byte string str with the correct character encoding, and Len (U ' han ') = = 1.

Take a look at the example methods of encode () and Decode () two basestring, and after understanding the differences between STR and Unicode, the two methods will no longer be confused:

# coding:utf-8u = U ' han ' Print repr (u) # u ' \u6c49 ' s = U.encode (' UTF-8 ') print repr (s) # ' \xe6\xb1\x89 ' U2 = S.decode (' UTF-8 ') p Rint Repr (U2) # u ' \u6c49 ' # decoding Unicode is wrong # s2 = U.decode (' UTF-8 ') # Similarly, encoding str is also wrong # U2 = S.encode (' UTF-8 ')

It is important to note that although calling the encode () method on Str is wrong, Python does not actually throw an exception, but rather returns a different str with the same content but with the same ID, and the decode () method is called for Unicode. I do not understand why encode () and decode () are not placed in Unicode and STR, but in basestring, but since this is the case, we should be careful to avoid making mistakes.

2.2. Character encoding Declaration

In the source code file, if it is useful to non-ASCII characters, you need to make a declaration of character encoding at the head of the file as follows:

#-*-Coding:utf-8-*-
In fact Python only checks #, coding, and encoded strings, and all the other characters are added for aesthetics. In addition, there are many character encodings available in Python, and there are many aliases that are not case sensitive, such as UTF-8 can be written as U8. See Http://docs.python.org/library/codecs.html#standard-encodings.

It is also important to note that the encoding of the declaration must be the same as the encoding used when the file is actually saved, otherwise there is a large chance of a code parsing exception. Now the IDE will generally automatically handle this situation, change the declaration and then replace the declaration of the encoding to save, but the text editor controls need to be careful:)

2.3. Read and Write files

When the built-in open () method opens the file, read () reads Str, and after reading it needs to be decode () using the correct encoding format. Write (), if the parameter is Unicode, you need to encode () with the encoding you wish to write, and if it is a different encoded format str, you need to first decode () with that Str's encoding, Convert to Unicode and then use the written encoding for Encode (). If you pass Unicode as a parameter directly to the write () method, Python will first encode and write using the character encoding declared by the source code file.

# coding:utf-8f = open (' test.txt ') s = F.read () f.close () print type (s) # 
 
  
   
  # known to be GBK encoded, decoded to Unicodeu = S.decode (' G BK ') F = open (' Test.txt ', ' W ') # Encoded into UTF-8 encoded STRs = U.encode (' UTF-8 ') f.write (s) f.close ()
 
  

In addition, module codecs provides an open () method that can specify an encoding for opening a file, and using this method to open a file read will return the Unicode. When writing, if the parameter is Unicode, the encoding specified with open () is encoded and then written, and if it is STR, it is first decoded into Unicode based on the character encoding declared by the source code file before the aforementioned operation. Compared with the built-in open (), this method is less prone to coding problems.

# Coding:gbkimport CODECSF = Codecs.open (' test.txt ', encoding= ' UTF-8 ') u = F.read () f.close () print type (u) # 
 
  
   
  f = codecs.open (' test.txt ', ' a ', encoding= ' UTF-8 ') # write Unicodef.write (u) # write to STR, automatic decoding encoding operation # GBK encoded STRs = ' Han ' Print Repr (s) # ' \XBA\XBA ' # This will first decode GBK encoded STR to Unicode and then encode to UTF-8 write F.write (s) f.close ()
 
  

2.4. Coding-related methods

Some methods for getting the default encoding in the current environment are provided in the Sys/locale module.

# Coding:gbkimport Sysimport Localedef P (f): print '%s.%s ():%s '% (f.__module__, f.__name__, F ()) # returns the default character encoding P (SYS) used by the current system. getdefaultencoding) # Returns the encoding used to convert the Unicode file name to the system file name (sys.getfilesystemencoding) # Gets the default locale and returns GANSO (language, encoding) p ( Locale.getdefaultlocale) # Return user-defined Text data encoding # Document mentions this function only returns a GUESSP (locale.getpreferredencoding) # \xba\ XBA is the ' Han ' GBK code # MBCS is the deprecated encoding, and here only tests show why it should not be used with print R "' \xba\xba '. Decode (' MBCS '):", repr (' \xba\xba '. Decode (' MBCS ')) # Results on the author's Windows (locale set to Chinese (Simplified, China)) #sys. getdefaultencoding (): Gbk#sys.getfilesystemencoding (): mbcs# Locale.getdefaultlocale (): (' zh_cn ', ' cp936 ') #locale. getpreferredencoding (): cp936# ' \xba\xba '. Decode (' MBCS '): U ' \ U6c49 '

3. Some recommendations

3.1. Use the character encoding declaration, and all source code files in the same project use the same character encoding declaration.

This is something that must be done.

3.2. Discard str, all using Unicode.

Press the quotation mark before you first do it really is not used to do and often forget to run back to fill, but if you can reduce the coding problem 90%. If the coding problem is not serious, you may not refer to this article.

3.3. Replace the built-in open () with Codecs.open ().

If the coding problem is not serious, you may not refer to this article.

3.4. Absolute need to avoid the use of character encoding: Mbcs/dbcs and UTF-16.

MBCS is not meant to be used by GBK or anything, but instead of using Python's code called ' MBCS ', unless the program is completely non-ported.

The encoding ' MBCS ' in Python is synonymous with ' DBCS ', which refers to the encoding of the MBCS reference in the current Windows environment. There is no such code in the Python implementation of Linux, so there is always an exception when porting to Linux! Also, the code for MBCS refers to a different set of Windows system regions. Set the results of the code in section 2.4 of the different zones to run separately:

#中文 (Simplified, China) #sys. getdefaultencoding (): Gbk#sys.getfilesystemencoding (): Mbcs#locale.getdefaultlocale (): (' Zh_CN ', ' cp936 ') #locale. getpreferredencoding (): cp936# ' \xba\xba '. Decode (' MBCS '): U ' \u6c49 ' #英语 (United States) #sys. Getdefaultencoding ( ): Utf-8#sys.getfilesystemencoding (): Mbcs#locale.getdefaultlocale (): (' zh_cn ', ' cp1252 ') # Locale.getpreferredencoding (): cp1252# ' \xba\xba '. Decode (' MBCS '): U ' \xba\xba ' #德语 (Germany) #sys. getdefaultencoding (): GBK #sys. getfilesystemencoding (): Mbcs#locale.getdefaultlocale (): (' zh_cn ', ' cp1252 ') #locale. getpreferredencoding (): cp1252# ' \xba\xba '. Decode (' MBCS '): U ' \xba\xba ' #日语 (Japan) #sys. getdefaultencoding (): Gbk#sys.getfilesystemencoding (): Mbcs#locale.getdefaultlocale (): (' zh_cn ', ' cp932 ') #locale. getpreferredencoding (): cp932# ' \xba\xba '. Decode (' MBCS ') : U ' \uff7a\uff7a '

As can be seen, after changing the area, using MBCS decoding to get incorrect results, so when we need to use ' GBK ', should write ' GBK ' directly, do not write ' MBCS '.

UTF-16 Similarly, while the vast majority of operating systems ' UTF-16 ' is synonymous with ' utf-16-le ', writing ' Utf-16-le ' only writes 3 more characters, and in case an operating system ' UTF-16 ' becomes synonymous with ' utf-16-be ', There will be the wrong result. In fact, UTF-16 used quite a few, but it still needs attention when used.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.