python encoding Conversion
This paper mainly introduces the encoding mechanism of Python, the conversion between Unicode, Utf-8, utf-16, GBK, gb2312,iso-8859-1 and so on.
Common encoding conversions are divided into the following scenarios:
Automatically identify string encodings
You can use the Chardet module to automatically identify character creation codes
Chardet How to use
Unicode conversion to other encodings (GBK, GB2312, etc.)
For example, A is a Unicode encoding to be converted to gb2312. A.encode (' gb2312 ')
#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312
The difference between GBK and GB2312
GB code, the full name is Gb2312-80 "The basic set of Chinese character encoding character set for information Interchange", published in 1980, is the national standard of Chinese processing, in the mainland and overseas use of simplified English regions (such as Singapore, etc.) is mandatory use of the only Chinese code. p-windows3.2 and Apple OS are based on GB2312 as the basic Chinese character coding, Windows 95/98 GBK as the basic Chinese character coding, but compatible support GB2312. GB code A total of 6,763 Simplified Chinese characters, 682 symbols, including Chinese characters: first-class word 3755, sorted by pinyin, two-word 3008, by the order of the Radicals. The formulation and application of this standard has played a very important role in the process of Chinese informatization.
GBK coding is a new Chinese code extension national standard developed by mainland China, which is equivalent to UCS. The GBK Working Group completed the GBK specification in October 1995, the same year in December. The coding standard is compatible with GB2312, a total of 21,003 Chinese characters, 883 symbols, and provides 1894 word-of-character code, simple, traditional characters in a library.
GBK includes all the encodings of GB2312, some GB2312 not, and need to be encoded with GBK.
Turn: GBK, the difference between gb2312,big5,unicode,utf-8,utf-16
Other encodings (UTF-8,GBK) converted to Unicode
For example: A is gb2312 encoded and is converted to Unicode. Unicode (A, ' gb2312 ') or A.decode (' gb2312 ')
#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode. Encode(' utf-8 ')print a_utf_8
Conversion between non-Unicode encodings
Encode 1 (gbk,gb2312) to encode 2 (utf-8,utf-16,iso-8859-1)
You can convert to Unicode and then to code 2
such as gb2312 turn Utf-8
#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode. Encode(' utf-8 ')print a_utf_8
Determining the encoding of a string
Isinstance (S, str) is used to determine whether a generic string
Isinstance (S, Unicode) is used to determine whether Unicode
If a string is already Unicode, then performing a Unicode conversion sometimes makes an error (not all errors)
The following code converts any string to Unicode
def u(S, Encoding): isinstanceUnicodeUnicode(S, encoding)
The difference between Unicode and other encodings
Why not all the files are Unicode, but also using gbk,utf-8 and other codes?
Unicode can be called abstract encoding, meaning it is an internal representation and cannot be stored directly.
When you save to disk, you need to convert it to the corresponding encoding, such as Utf-8 and utf-16.
Other methods
In addition to the above coding methods, you can also use the codecs open method to convert to and from the file when reading and writing.
command line default encoding detection and setup
You can detect the command line default encoding and set the command line encoding by using the module locale that comes with Python.
Locale locale. Getdefaultlocale()# (' Zh_cn ', ' cp936 ') #setlocale. SetLocale(... )
Chinese character to Unicode encoding
Pd_name = Pd_name. decode(' utf-8 ') print pd_name ord(c); Nname + = C
Python Learning Note--day5 (reprint)