This article summarizes the Python encoding in detail. Share to everyone for your reference, as follows:
"So-called Unicode"
Unicode is an abstract code similar to a set of symbols that specifies only the binary code of the symbol, but does not specify how the binary code should be stored. That is, it is just an internal representation and cannot be saved directly. Therefore, storage needs to specify a form of storage, such as Utf-8 and utf-16. In theory, Unicode is an encoding scheme that accommodates all languages in the world. (Other coding formats no longer speak more)
"The so-called GB code"
GB is the meaning of "GB", that is: the national standards of the People's Republic of China. GB code is a Chinese-oriented encoding, including GB2312 (gb2312-80), gbk,gb18030, the range is small to large increments, and basically backward-compatible. In addition, often encountered a code called CP936, can actually be seen as GBK.
"Judgment Code"
1, use Isinstance (S, str) to determine whether a string is a generic string (str is an ASCII type of string, Utf-8, utf-16, GB2312, GBK, etc. are ASCII type of string);
Use Isinstance (S, Unicode) to determine whether a string is a Unicode encoded string (a Unicode-encoded string is a string of Unicode type).
2. Use type () or. __class__
In the case of the correct encoding:
For example: Stra = "Medium", the result of using type (stra) is <type ' str ', which indicates an ASCII type string;
For example: STRB = u "Medium", the result of using type (STRB) is <type ' Unicode ', which indicates a string of Unicode type.
Tmp_str = ' tmp_str ' Print tmp_str.__class__ #<type ' str ' >print type (tmp_str) #<type ' str ' >print Type (TMP_STR). __name__ #strtmp_str = U ' tmp_str ' Print tmp_str.__class__ #<type ' Unicode ' >print type (tmp_str ) #<type ' Unicode ' >print type (tmp_str). __name__ #unicode
3, the best way is to use chardet judgment, especially in the web-related operations, such as crawling HTML page content, the page's charset tag is only the coding, sometimes wrong, and the content of the page some Chinese may be beyond the scope of the marking code, At this time with charset detection is most convenient and accurate.
(1) Installation method: After downloading Chardet, the extracted Chardet folder is placed in the Python installation directory \lib\site-packages directory, in the program using the import Chardet.
(2) Use Method 1: Detect all content judgment code
Import Urllib2import chardetres = Urllib2.urlopen (' http://www.php.cn ') Res_cont = Res.read () res.close () print Chardet.detect (Res_cont) #{' confidence ': 0.99, ' encoding ': ' Utf-8 '}
The Detect function returns a dictionary containing 2 key-value pairs, the first of which is the detection confidence level, and the second is the encoded form that is detected.
(3) Use Method 2: Detect part of the content to determine the code, improve speed
Import urllib2from chardet.universaldetector Import universaldetectorres = Urllib2.urlopen (' http://www.php.cn ') detector = Universaldetector () for line in Res.readlines (): #detect untill reach threshold detector.feed (line) if detector. Done: breakdetector.close () res.close () print detector.result#{' confidence ': 0.99, ' encoding ': ' Utf-8 '}
"Conversion Encoding"
1, from the specific code (ISO-8859-1[ASCII code],utf-8,utf-16,gbk,gb2312, etc.) converted to Unicode, directly using Unicode (S, CharSet) or S.decode (CharSet), Where CharSet is the encoding of S (note that Unicode will go wrong when using decode ());
#将任意字符串转换为unicodedef To_unicode (S, encoding): If Isinstance (S, Unicode): return s else: return Unicode (S, Encoding
Note: here in decode (), if you encounter illegal characters (such as the non-standard full-width space \xa3\xa0, or \xa4\x57, the real full-width space is \xa1\xa1), will be an error.
Solution: Adopt ' ignore ' mode, namely: Stra.decode (' ... ', ' ignore '). Encode (' Utf-8 ').
Explanation: The function prototype of Decode is decode ([encoding],[errors= ' strict ']) and the second parameter can be used to control the policy of error handling.
The default parameter is strict, which throws an exception when an illegal character is encountered, or if it is set to ignore, the illegal character is ignored, or if it is set to replace, the illegal character is substituted, or if set to Xmlcharrefreplace, the character reference of the XML is used.
2, from Unicode to specific encoding, but also directly with S.encode (CharSet), where S is Unicode encoding, CharSet for the specific encoding (note that non-Unicode in the use of Encode () error);
3, naturally, from a specific code conversion to another specific code, you can first decode into Unicode and then encode into the final code.
"Python Command line Encoding (System encoding)"
Use a python-brought locale module to detect the default encoding of the command line (that is, the system's encoding) and set the command line encoding:
Import locale#get Coding Typeprint Locale.getdefaultlocale () # (' Zh_cn ', ' cp936 ') #set coding Typelocale.setlocale ( Locale. Lc_all, locale= ' ZH_CN. GB2312 ') print Locale.getlocale () # (' Zh_cn ', ' gb2312 ')
Indicates that the internal encoding of the current system is cp936, approximately GBK. In fact, the Chinese XP and WIN7 system internal code is cp936 (GBK).
"Encoding in Python code"
1. In Python code, the default encoding is consistent with the code file itself, in cases where the string is not specified. For example: str = ' Chinese ' is the string, if it is in the UTF8 encoded code file, the string is UTF8 encoding, if it is in gb2312 file, the string is gb2312 encoding. So how do you know the encoding of the code file itself?
(1) The encoding of the specified code file: Add "#-*-coding:utf-8-*-" to the head of the code file to declare that the code file is UTF-8 encoded. The encoding of the string that is not specified at this time becomes utf-8.
(2) When code files are not specified, code files are created using the encoding that Python uses by default (typically ASCII, which is actually saved as cp936 (GBK) encoding in Windows). The default encoding is obtained and set by Sys.getdefaultencoding () and sys.setdefaultencoding (' ... ').
Import sysreload (SYS) print sys.getdefaultencoding () #asciisys. setdefaultencoding (' Utf-8 ') print Sys.getdefaultencoding () #utf-8
Combine (1) and (2) to do an experiment: When the code file is encoded as Utf-8, the Utf-8 without DOM encoding is opened with notepad++, and when the code file encoding is not specified, the ANSI encoding (compression encoding, the default save encoding form) is displayed with notepad++.
(3) How to permanently set Python's default encoding to Utf-8? There are 2 ways to do this:
First method < not recommended;: Edit site.py, modify setencoding () function, force set to Utf-8;
The second method < recommendation;: Add a file named sitecustomize.py, stored in the \lib\site-packages directory under the installation directory
sitecustomize.py is executed in site.py by import, because sys.setdefaultencoding () is deleted at the end of site.py, so it can be used in sitecustomize.py Sys.setdefaultencoding ().
2, the string in the Python code if the encoding is specified, for example: str = u ' Chinese ', the encoding of the string is specified as Unicode (that is, Python's internal encoding).
(1) There is a misunderstanding here to pay attention! If you have the following code in the PY file:
Stra = u "in" Print Stra.encode ("GBK")
According to the above-mentioned stra is the Unicode form, direct encode called GBK coding should be no problem ah? However, the actual execution will be error "Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \xd6 ' in position 0:illegal multibyte sequence '.
The reason: When the Python interpreter imports the Python code file and executes it, it first looks at the file header for any coding declarations (such as #coding:gbk, etc.). If a declaration is found, the string in the file is interpreted first as Unicode (the default encoding GBK (cp936) is used to decode the stra to Unicode encoded ' d6d0 '), and then Stra.encode (' GBK ') is executed. Since Stra is already Unicode encoded and ' d6d0 ' is within the encoding range of GBK, there is no error in encoding, and if the file header does not have an encoding declaration, then the decoding operation in the above procedure is not performed (directly using the Stra Unicode encoding ' d6 '). After executing stra.encode (' GBK '), the error occurs because ' D6 ' is not in the GBK encoding range.
(2) To avoid this type of error, it is best to declare the encoding on the header of the code file, or to use setdefaultencoding () every time the trouble points.
(3) In general, Unicode is the inner code of the Python interpreter, and when all code files are imported and executed, the Python interpreter first decodes the string into Unicode using the encoding form you specified, and then performs various operations. So whether it's a string operation, a regular expression, a read-write file, and so on, it's best to use Unicode.
"Other encodings in Python"
File system encoding: Sys.getfilesystemencoding ()
Input code for Terminal: sys.stdin.encoding
Output code for Terminal: sys.stdout.encoding