Introduction to character encoding in Python, methods and suggestions for its use _python

Source: Internet
Author: User
Tags locale

1. Introduction to character encoding

1.1. ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte encoding. In the computer world, only in English, and Single-byte can represent 256 different characters, can represent all the English characters and many control symbols. But ASCII uses only half of it (\x80 below), which is also the basis for MBCS's implementation.

1.2. MBCS

However, the computer world soon has other languages, Single-byte ASCII is not enough to meet the needs. Then each language developed a set of its own code, because the word energy-saving represents too few characters and also needs to be compatible with ASCII encoding, these encodings use multiple bytes to represent characters, such as Gbxxx, bigxxx, and so on, and their rule is that if the first byte is \x80 below, The ASCII character is still represented, and if it is above \x80, the next byte (a total of two bytes) represents a character, then skips the next byte and continues to be judged.

Here, IBM invented a concept called the code page, which will be encoded in the bag and assign page numbers, GBK is the No. 936 page, that is, CP936. Therefore, you can also use CP936 to represent GBK.

MBCS (Multi-Byte Character Set) is a generic term for these encodings. So far everyone has used double-byte, so it is sometimes called DBCS (Double-byte Character Set). It's important to be clear that MBCS is not a particular encoding, and in Windows, depending on the area you set up, MBCS refers to different encodings, and Linux cannot use MBCS as the encoding. You can't see MBCS these characters in Windows, because Microsoft uses ANSI to scare people in order to be more brim, Notepad's Save As dialog box code ANSI is MBCS. At the same time, in the Simplified Chinese Windows default locale, refer to GBK.

1.3. Unicode

Later, some people began to think that too much coding caused the world to become too complicated, so that people head pain, so we sit together and clap their heads to come up with a method: All language characters are represented by the same character set, which is Unicode.

The original Unicode standard UCS-2 uses two bytes to represent one character, so you can often hear the notion that Unicode uses two bytes to represent one character. But after a while someone felt 256*256 too little, or not enough, so the UCS-4 standard, it uses 4 bytes to represent a character, but we use the most is still UCS-2.

UCS (Unicode Character Set) is only a list of characters corresponding to the code bit, such as "Han" the word code is 6c49. The exact way in which the characters are transmitted and stored is UTF (UCS transformation Format).

At first it was simple to use the UCS code to save, which is UTF-16, for example, "Han" directly using \x6c\x49 to save (utf-16-be), or use \x49\x6c to save (utf-16-le) upside down. But with the Americans feel that they eat a big loss, the previous English alphabet only need one byte to be able to save, now pot A eat into two bytes, space consumption a big one times ... So UTF-8 was born.

UTF-8 is a very awkward code that behaves as if he is getting longer, and the compatible ascii,ascii character uses 1 byte representations. However, the province must be from other places to pull out, you must have heard of UTF-8 in the Chinese characters using 3 bytes to save it? 4 bytes saved characters are more in tears running ... (Specific UCS-2 is how to become UTF-8, please search yourself)

Also worth mentioning is the BOM (Byte order Mark). When we save the file, the encoding used in the file is not saved, and when we open it we need to remember the encoding used in the original save and use the code to open it, which creates a lot of trouble. (You may want to say that Notepad does not have the option to open the file.) You may want to open Notepad and then use the file-> to open the look) and UTF introduced a BOM to represent their own code, if the first read the number of bytes is one of them, the next text to read the encoding used is the corresponding encoding:

Bom_utf8 ' \XEF\XBB\XBF '
Bom_utf16_le ' \xff\xfe '
Bom_utf16_be ' \xfe\xff '

Not all editors will write to the BOM, but even if there is no bom,unicode or readable, just like the MBCS encoding, a specific encoding needs to be specified, or the decoding will fail.

You may have heard that UTF-8 do not need a BOM, this is not true, but most editors in the absence of a BOM are UTF-8 as the default encoding read. Even if you use the ANSI (MBCS) Notepad by default on save, you can read the file using the UTF-8 test encoding first, or use UTF-8 decoding if the decoding is successful. Notepad this awkward practice caused a bug: If you create a new text file and enter the "Cha 塧" and then use ANSI (MBCS) to save, then open it will become "Han a", you may wish to try:

2. Coding problems in python2.x

2.1. STR and Unicode
Both STR and Unicode are basestring subclasses. In the strictest sense, str is actually a byte string, which is a sequence of encoded bytes of Unicode. The UTF-8 encoded str ' Han ' uses the Len () function when the result is 3, because in fact, UTF-8 encodes ' han ' = = ' \xe6\xb1\x89 '.

Unicode is the true string, and the byte string str is decoded with the correct character encoding, and Len (U ' han ') = 1.

Looking at the encode () and Decode () two basestring instance methods, understanding the difference between STR and Unicode, the two methods are no longer confusing:

Copy Code code as follows:

# Coding:utf-8
u = U ' Han '
Print repr (u) # u ' \u6c49 '
s = U.encode (' UTF-8 ')
Print repr (s) # ' \xe6\xb1\x89 '
U2 = S.decode (' UTF-8 ')
Print Repr (U2) # u ' \u6c49 '
# decoding Unicode is an error
# s2 = u.decode (' UTF-8 ')
# Similarly, it is wrong to encode str
# U2 = S.encode (' UTF-8 ')

It is important to note that while calling the Encode () method on Str is wrong, Python does not throw an exception, but instead returns a different str with the same content but with the same ID, as is the case for Unicode invocation of the Decode () method. I don't understand why we don't put encode () and decode () in Unicode and STR, but in basestring, but since we've done that, we're careful to avoid making mistakes.

2.2. Character encoding Declaration

In the source code file, if it is useful to non-ASCII characters, you need to make a character-coded declaration on the head of the file as follows:

Copy Code code as follows:

#-*-Coding:utf-8-*-

In fact, Python only checks #, coding, and encoded strings, and other characters are added for beauty. In addition, there are many character encodings available in Python, and there are many aliases, which are not case-sensitive, such as UTF-8 can be written as U8. See http://docs.python.org/library/codecs.html.

It is also important to note that the declared encoding must be consistent with the encoding used when the file was actually saved, otherwise there would be a high probability of code parsing exceptions. The IDE now generally handles this automatically, changing the declaration and replacing it with a declarative code save, but the text editor needs to be careful:

2.3. Read and Write files

When the built-in open () method opens the file, read () reads Str and needs to be read with the correct encoding format for decode (). When write () writes, if the parameter is Unicode, you need to use the encoding you want to write to encode (), and in the case of STR in other encoded formats, you need to first decode () with the STR encoding. Convert to Unicode and use the write encoding for encode (). If Unicode is passed directly as a parameter to the write () method, Python encodes and writes using the character encoding of the source code file declaration.

Copy Code code as follows:

# Coding:utf-8
f = open (' Test.txt ')
s = F.read ()
F.close ()
Print type (s) # <type ' str ' >
# known to be GBK encoded, decoded into Unicode
U = S.decode (' GBK ')
f = open (' Test.txt ', ' W ')
# str encoded as UTF-8 code
s = U.encode (' UTF-8 ')
F.write (s)
F.close ()

In addition, the module codecs provides an open () method that specifies an encoding to open the file, and the file read that is opened using this method will return Unicode. When written, if the parameter is Unicode, the encoding specified when the open () is encoded and written, and if STR is encoded according to the character encoding of the source code file declaration, then the preceding operation is decoded. Relative to the built-in open (), this method is less prone to coding problems.

Copy Code code as follows:

# CODING:GBK
Import Codecs
f = codecs.open (' test.txt ', encoding= ' UTF-8 ')
U = F.read ()
F.close ()
Print type (u) # <type ' Unicode ' >
f = codecs.open (' test.txt ', ' a ', encoding= ' UTF-8 ')
# Write to Unicode
F.write (U)
# writes STR, automatically decodes and encodes operations
# GBK-encoded STR
s = ' Han '
Print repr (s) # ' \XBA\XBA '
# This will first decode the GBK-encoded STR to Unicode and then encode to UTF-8 write
F.write (s)
F.close ()

2.4. Coding-related methods
The Sys/locale module provides methods for obtaining the default encoding in the current environment.

Copy Code code as follows:

# CODING:GBK
Import Sys
Import locale
Def P (f):
print '%s.%s ():%s '% (f.__module__, f.__name__, F ())
# returns the default character encoding used by the current system
P (sys.getdefaultencoding)
# returns the encoding used to convert the Unicode file name to the system file name
P (sys.getfilesystemencoding)
# get the default locale and return Ganso (language, encoding)
P (Locale.getdefaultlocale)
# returns the user-defined text data encoding
# document refers to this function only returns a Guess
P (locale.getpreferredencoding)
# \XBA\XBA is the GBK code of ' Han '
# MBCS is not recommended for use in the code, here only for testing to show why should not use
Print R "' \xba\xba '. Decode (' MBCS '):", repr (' \xba\xba '. Decode (' MBCS '))
#在笔者的Windows上的结果 (Locale to Chinese (Simplified, Chinese))
#sys. getdefaultencoding (): GBK
#sys. Getfilesystemencoding (): MBCS
#locale. Getdefaultlocale (): (' zh_cn ', ' cp936 ')
#locale. getpreferredencoding (): cp936
# ' \xba\xba '. Decode (' MBCS '): U ' \u6c49 '

3. A number of recommendations

3.1. Use character encoding declaration, and all source code files in the same project use the same character encoding declaration .
This is something that must be done.

3.2. Discard str, all using Unicode.
Click U first before quoting quotes you are really not accustomed to doing it and often forget to run back to make up, but if you do so you can reduce the coding problem by 90%. If the coding problem is not serious, you can not refer to this article.

3.3. Replace the built-in open () with Codecs.open ().
If the coding problem is not serious, you can not refer to this article.

3.4. The absolute need to avoid the use of character encoding: Mbcs/dbcs and UTF-16.
The MBCS here does not mean GBK or anything, but instead of using the code named ' MBCS ' in Python, unless the program is not ported at all.

Encoding ' MBCS ' in Python is synonymous with ' DBCS ', which refers to the encoding of MBCS in the current Windows environment. Linux does not have this encoding in the Python implementation, so once porting to Linux must be abnormal! Also, as long as the locale of the Windows system is different, the encoding of the MBCS is not the same. Set the results of the code in the 2.4 section of the different zones to run separately:

Copy Code code as follows:

#中文 (Simplified, China)
#sys. getdefaultencoding (): GBK
#sys. Getfilesystemencoding (): MBCS
#locale. Getdefaultlocale (): (' zh_cn ', ' cp936 ')
#locale. getpreferredencoding (): cp936
# ' \xba\xba '. Decode (' MBCS '): U ' \u6c49 '
#英语 (United States)
#sys. getdefaultencoding (): UTF-8
#sys. Getfilesystemencoding (): MBCS
#locale. Getdefaultlocale (): (' zh_cn ', ' cp1252 ')
#locale. getpreferredencoding (): cp1252
# ' \xba\xba '. Decode (' MBCS '): U ' \xba\xba '
#德语 (Germany)
#sys. getdefaultencoding (): GBK
#sys. Getfilesystemencoding (): MBCS
#locale. Getdefaultlocale (): (' zh_cn ', ' cp1252 ')
#locale. getpreferredencoding (): cp1252
# ' \xba\xba '. Decode (' MBCS '): U ' \xba\xba '
#日语 (Japan)
#sys. getdefaultencoding (): GBK
#sys. Getfilesystemencoding (): MBCS
#locale. Getdefaultlocale (): (' zh_cn ', ' cp932 ')
#locale. getpreferredencoding (): cp932
# ' \xba\xba '. Decode (' MBCS '): U ' \uff7a\uff7a '

Visible, after changing the area, using MBCS decoding to get incorrect results, so when we need to use ' GBK ', we should write ' GBK ', do not write ' MBCS '.

UTF-16 Similarly, while most operating systems ' UTF-16 ' is synonymous with ' utf-16-le ', writing ' Utf-16-le ' is just 3 more characters, and in the event of an operating system where ' UTF-16 ' becomes synonymous with ' utf-16-be ', There will be the wrong result. In fact, UTF-16 is used quite a little, but it needs to be noticed when used.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.