Python encoding [conversion] And python Encoding
Python encoding [conversion]
Navigation: (1) (2) (3)
(1)
How to Avoid UnicodeEncodeError: 'ascii 'codec can't... Similar errors?
1. First specify the File Content Encoding in The py file header, for example: # coding: utf8
2. the encoding of the file header must be consistent with that of the py file during file storage.
3. When using decode and encode, be sure to confirm the original encoding of the characters to be converted.
For example, encoding (<meta http-equiv = content-type content = "text/html; charset = gb2312">) is specified in the webpage ), you need to pay attention to the code conversion after capturing this website and obtaining its html:
Import urllib2
Html = urllib2.urlopen (url)
Html = html. decode ('gb2312 ′)
As long as the preceding three operations are performed, no conversion Encoding Error will occur.
It is recommended that all the variables in the python code be unicode. The process can be written like this: variable (converted to unicode) --> python code ---> variable (converted to another encoding)
Sys. getdefaultencoding (): the default encoding of the system (usually ascii). The default encoding of python is ascii. This is why encoding is specified in the header of the py file # coding: UTF-8
Several functions for Python to obtain system encoding Parameters
System default encoding (generally ascii): sys. the current encoding of getdefaultencoding () system: locale. the Code temporarily changed in the getdefalocallocale () system code (through locale. setlocale (locale. LC_ALL, "zh_CN.UTF-8"): locale. getlocale () File System Code: sys. input code of the getfilesystemencoding () terminal: sys. stdin. output code of the encoding terminal: sys. stdout. default encoding of the encoding code: File Header #-*-coding: UTF-8 -*-
Source: http://justpy.com/archives/144
(2)
More:
Http://www.cnblogs.com/itrust/archive/2010/05/14/1735185.html
Is there two types of string python?
12 |
byteString = "hello world! (in my default locale)" unicodeString = u "hello Unicode world!" |
Mutual conversion?
1234 |
1 s = "hello normal string" 2 u = unicode ( s, "utf-8" ) 3 backToBytes = u.encode( "utf-8" ) 3 backToUtf8 = backToBytes.decode(‘utf - 8 ’) # Same effect as the second line |
How to judge?
123 |
if isinstance ( s, str ): # For Unicode strings, the result is False. if isinstance ( s, unicode ): # For Unicode strings, the result is True. if isinstance ( s, basestring ): # Returns True for both strings. |
Do a test?
123456 |
import sys print 'default encoding: ' , sys.getdefaultencoding() print 'file system encoding: ' , sys.getfilesystemencoding() print 'stdout encoding: ' , sys.stdout.encoding print u 'U "Chinese" is unicode :' , isinstance (u 'Chinese' , unicode ) print u '"Chinese" is unicode :' , isinstance ( 'Chinese' , unicode ) |
Check the output result and pay attention to the following facts:
The default encoding format of the python system is ASCII. This default encoding is used when Python converts strings. Here are two examples:
1. a = "abc" + u "bcd", Python converts "abc". decode (sys. getdefaultencoding () and then combines the two Unicode characters.
2. print unicode ('Chinese'). If this statement is executed, the following error occurs: "UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 ...", Because Python tries to use the default encoding, and this string is not ASCII, it should be shown that if your file source type is UTF-8, this should be the case: print unicode ('Chinese', 'utf-8 ')
In Windows, getfilesystemencoding outputs mbcs (Multi-byte encoding, windows mbcs, Which is ansi. It uses different encoding in windows of different languages, in Chinese windows, it is the encoding of the gb series)
In Windows, the console code is cp936. When you print something to the console, Python automatically converts it. This will cause an interesting problem. Try this simple example test. py:
?
123 |
# -*- coding: utf-8 -*- s = u 'Chinese' print s |
Run python test. py and python test. py> 1.txt on the console respectively.
You will find that the latter will report an error because Python will automatically convert the encoding to sys when printing the console. stdout. encoding, but Python does not automatically convert internal characters in the write call when it is output to a file. This issue is described in more detail in PrintFails.
UTF-8 encoding format to save UTF-8 format files?
123 |
import codecs fileObj = codecs. open ( "someFile" , "r" , "utf-8" ) u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file |
Write the BOM header by yourself?
1234 |
out = file ( "someFile" , "w" ) out.write( codecs.BOM_UTF8 ) out.write( unicodeString.encode( "utf-8" ) ) out.close() |
Remove BOM header by yourself
For the UTF-16, Python decodes BOM as an empty string. However for the UTF-8, BOM is decoded as a character, for example:
?
1234 |
>>> codecs.BOM_UTF16.decode( "utf16" ) u'' >>> codecs.BOM_UTF8.decode( "utf8" ) u '\ufeff' |
I don't know why this is different, so you need to remove the BOM when reading the file:
?
1234567891011 |
import codecs if s.beginswith( codecs. BOM_UTF8 ): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8 if u[ 0 ] == unicode( codecs. BOM_UTF8 , "utf8" ): # The unicode string begins with the BOM: Do something. # For example, remove the character. # Strip the BOM from the beginning of the Unicode string, if it exists u.lstrip( unicode( codecs. BOM_UTF8 , "utf8" ) ) |
Encoding of source code files
PEP0263 is clear about how Python codes code files. The following is an excerpt:
Python uses ASCII encoding by default.
You can add a declaration file encoding statement to the first or second line of the Code to notify python of the encoding format of the file, as shown in
#-*-Coding: UTF-8-*-# note that this encoding format is used when the file is saved.
- If the file does not contain an encoding statement, python uses utf8 for processing.
- If the encoding statement is not UTF-8, python reports an error.
================ In addition,
(3)
Some software, such as notepad, inserts three invisible characters (0xEF 0xBB 0xBF, BOM) at the beginning of the file when saving a UTF-8-encoded file ). Therefore, we need to remove these characters during reading. The codecs module in python defines this constant:
?
123456 |
# coding=gbk import codecs data = open ( "Test.txt" ).read() if data[: 3 ] = = codecs.BOM_UTF8: data = data[ 3 :] print data.decode( "utf-8" ) |