Python has two types of strings.
ByteString = "hello world! (In my default locale )"
UnicodeString = u "hello Unicode world! "
Mutual Conversion
1 s = "hello normal string"
2 u = unicode (s, "UTF-8 ")
3 backToBytes = u. encode ("UTF-8 ")
3 backToUtf8 = backToBytes. decode ('utf-8') # same effect as the second line
How to judge
If isinstance (s, str): # For Unicode strings, the result is False.
If isinstance (s, unicode): # For Unicode strings, the result is True.
If isinstance (s, basestring): # returns True for both strings.
Do a test
Example import sys print 'default encoding: ', sys. getdefaultencoding () print 'file system encoding: ', sys. getfilesystemencoding () print 'stdout encoding: ', sys. stdout. encodingprint u 'U "Chinese" is unicode: ', isinstance (u 'China', unicode) print U' "Chinese" is unicode:', isinstance ('China', unicode)
Check the output result and pay attention to the following facts:
The default encoding format of the python system is ASCII. This default encoding is used when Python converts strings. Here are two examples:
1. a = "abc" + u "bcd", Python converts "abc". decode (sys. getdefaultencoding () and then combines the two Unicode characters.
2. print unicode ('Chinese'). If this statement is executed, the following error occurs: "UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 ...", Because Python tries to use the default encoding, and this string is not ASCII, it should be shown that if your file source type is UTF-8, this should be the case: print unicode ('Chinese', 'utf-8 ')
In Windows, getfilesystemencoding outputs mbcs (Multi-byte encoding, windows mbcs, Which is ansi. It uses different encoding in windows of different languages, in Chinese windows, it is the encoding of the gb series)
In Windows, the console code is cp936. When you print something to the console, Python automatically converts it. This will cause an interesting problem. Try this simple example test. py:
Example #-*-coding: UTF-8-*-s = u'chinese' print s
Run python test. py and python test. py> 1.txt on the console respectively.
You will find that the latter will report an error because Python will automatically convert the encoding to sys when printing the console. stdout. encoding, but Python does not automatically convert internal characters in the write call when it is output to a file. This issue is described in more detail in PrintFails.
UTF-8 encoding format saves UTF-8 format files
Import codecs
FileObj = codecs. open ("someFile", "r", "UTF-8 ")
U = fileObj. read () # Returns a Unicode string from the UTF-8 in the file
Write BOM header by yourself
Out = file ("someFile", "w ")
Out. write (codecs. BOM_UTF8)
Out. write (unicodeString. encode ("UTF-8 "))
Out. close ()
Remove BOM header by yourself
For the UTF-16, Python decodes BOM as an empty string. However for the UTF-8, BOM is decoded as a character, for example:
Example>>> Codecs. BOM_UTF16.decode ("utf16 ")
U''
>>> Codecs. BOM_UTF8.decode ("utf8 ")
U' \ ufeff'
I don't know why this is different, so you need to remove the BOM when reading the file:
Remove BOMimport codecsif s. beginswith (codecs. BOM_UTF8): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8if u [0] = unicode (codecs. BOM_UTF8, "utf8"): # The unicode string begins with the BOM: Do something. # For example, remove the character. # Strip the BOM from the beginning of the Unicode string, if it existsu. lstrip (unicode (codecs. BOM_UTF8, "utf8 "))
Encoding of source code files
PEP0263 is clear about how Python codes code files. The following is an excerpt:
Python uses ASCII encoding by default.
You can add a declaration file encoding statement to the first or second line of the Code to notify python of the encoding format of the file, as shown in
#-*-Coding: UTF-8-*-# note that this encoding format is used when the file is saved.
- For a platform like Windows, it uses BOM (three bytes in the file header \ xef \ xbb \ xbf) to declare that the file is UTF-8 encoded. In this case:
- If the file does not contain an encoding statement, python uses utf8 for processing.
- If the encoding statement is not UTF-8, python reports an error.