Three articles, navigation: (i) (b) (c)
A
How to avoid unicodeencodeerror: ' ASCII ' codec can ' t ... A similar error?
1, first specify the file content encoding in the header of the Py file, for example: # Coding:utf8
2, the file to save the time and the Py file header code consistent
3, in the use of decode and encode, be sure to confirm to convert the word Fu Yuan encoding is what.
For example: The code will be specified in the Web page (<meta http-equiv=content-type content= "text/html; charset=gb2312″>), When you crawl this site and get its HTML, you'll need to be aware of the code conversions:
Import Urllib2
html = urllib2.urlopen (URL)
html = Html.decode (' gb2312′)
Just do the top three, there's no conversion coding error.
Python suggests that it is best to have all the variables in Python code Unicode; The process can be written like this: variables (converted to Unicode)-->python code--–> variables (converted to other encodings)
Sys.getdefaultencoding (): The default encoding of the system (typically ASCII), the Python default language encoding is ASCII encoding, which is why the header of the py file must be specified encoded # Coding:utf-8
Python gets several functions of system encoding parameters
The system's default encoding (typically ASCII): sys.getdefaultencoding () system current Encoding: Locale.getdefaultlocale () code temporarily changed in the system codes (via Locale.setlocale ( Locale. Lc_all, "ZH_CN. Utf-8″)): Locale.getlocale () file system encoding: sys.getfilesystemencoding () terminal input code: sys.stdin.encoding Terminal Output code: Sys.stdout. Default encoding for encoding code: on file Header #-*-coding:utf-8–*-
Source: http://justpy.com/archives/144
Two
More:
Http://www.cnblogs.com/itrust/archive/2010/05/14/1735185.html
String Python has two types of strings?
| 12 |
byteString ="hello world! (in my default locale)"unicodeString =u"hello Unicode world!" |
Convert each other?
| 1234 |
1 s = "hello normal string"2 u = unicode( s, "utf-8" )3 backToBytes = u.encode( "utf-8" )3 backToUtf8 = backToBytes.decode(‘utf-8’) #与第二行效果相同 |
How to judge?
| 123 |
ifisinstance( s, str ): # 对Unicode strings,这个判断结果为Falseif isinstance( s, unicode): # 对Unicode strings,这个判断结果为Trueif isinstance( s, basestring): # 对两种字符串,返回都为True |
Do an experiment?
| 123456 |
import sys print ‘default encoding: ‘ , sys.getdefaultencoding()print ‘file system encoding: ‘ , sys.getfilesystemencoding()print ‘stdout encoding: ‘ , sys.stdout.encodingprint u‘u"中文" is unicode: ‘, isinstance(u‘中文‘,unicode)print u‘"中文" is unicode: ‘, isinstance(‘中文‘,unicode) |
Look at the output and note the following facts:
The default encoding format for the Python system is ASCII, which is used when Python converts a string, giving two examples:
1. A = "abc" + U "BCD", Python will convert "abc" so. Decode (Sys.getdefaultencoding ()) and then two Unicode words.
2. Print Unicode (' Chinese '), this sentence execution will error "Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 ..." Because Python is trying to encode by default encoding, and this string does not is ASCII, so it needs to be shown that if your file source type is utf-8, this should be the case: Print Unicode (' Chinese ', ' utf-8 ')
Windows getfilesystemencoding output MBCS (Multi-byte encoding, Windows MBCS, which is ANSI, which uses different encodings in different languages of Windows, is the GB-series encoding in Chinese windows)
The Windows console is encoded as cp936, and Python automatically transforms when you print something to the console. Here's an interesting question, try this simple example test.py:
?
| 123 |
# -*- coding: utf-8 -*-s =u‘中文‘prints |
Run Python test.py and Python test.py > 1.txt separately in console
You will notice that the latter will error because Python automatically converts the encoding to sys.stdout.encoding when printing the console, and Python does not automatically convert internal characters in the write call when outputting to a file. This question is described in more detail in the printfails.
UTF-8 encoded format save files in utf-8 format?
| 123 |
importcodecsfileObj = codecs.open( "someFile", "r", "utf-8" )u =fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file |
Write your own BOM head?
| 1234 |
out =file( "someFile", "w" )out.write( codecs.BOM_UTF8 )out.write( unicodeString.encode( "utf-8") )out.close() |
Get rid of BOM head yourself
For UTF-16, Python decodes the BOM into an empty string. However, for UTF-8, the BOM is decoded to a single character, as in the example:
?
| 1234 |
>>> codecs.BOM_UTF16.decode( "utf16") u‘‘ >>> codecs.BOM_UTF8.decode( "utf8") u‘\ufeff‘ |
Don't know why this is so different, so you need to remove the BOM yourself when you read the file:
?
| 1234567891011 |
import codecsif s.beginswith( codecs.BOM_UTF8 ): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8 if u[0] == unicode( codecs.BOM_UTF8, "utf8" ): # The unicode string begins with the BOM: Do something. # For example, remove the character.# Strip the BOM from the beginning of the Unicode string, if it existsu.lstrip( unicode( codecs.BOM_UTF8, "utf8" ) ) |
Encoding of source files
About Python encoding the code of the file processing, PEP0263 is very clear, is now excerpted as follows
Python defaults to a file that is ASCII encoded.
You can add a claim file encoding declaration on a row or two line at the terminal to notify Python of the encoded format of the file, such as
#-*-coding:utf-8–*-# Note the editor used to ensure that the file is saved with the encoding format
- For a platform like windows, it uses the BOM (file header three bytes \xef\xbb\xbf) to declare that the file is Utf-8 encoded, in this case:
- If there is no encoding declaration in the file, Python handles the UTF8
- If there is a code statement but not utf-8, python error
============== In addition, about bom================
Three
Some software, such as Notepad, inserts three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of a file when a UTF-8 encoded file is saved. So we need to remove these characters when we read them, and the codecs module in Python defines this constant:
?
| 123456 |
# coding=gbkimport codecsdata = open("Test.txt").read()if data[:3] == codecs.BOM_UTF8: data = data[3:]print data.decode("utf-8") |
Python-related encoding [go]