Python coding problem is basically every novice will encounter the threshold, but as long as the full grasp of the jump over the pit, million change it, this is not recently I have encountered this problem, come together to see it.
The cause of the matter is review colleague to do an upload function, see the following section of code, SELF.FP is an uploaded file handle
fpdata = [line.strip().decode(‘gbk‘).encode(‘utf-8‘).decode(‘utf-8‘forin self.fp]data = [‘‘.join([‘(‘‘,‘‘,‘.join(map(lambda"‘%s‘" % x, d.split(‘,‘‘)‘forin fpdata[1:]]
This code exposes 2 issues
1. The default encoding uses GBK, why not UTF8?
2..encode (' Utf-8 '). Decode (' Utf-8 ') is completely unnecessary, and decode (' GBK ') is already Unicode.
I recommend uploading text encoded as UTF8, so the code becomes this?
fpdata = [lineforlineinifline.strip()]data = [‘‘.join([‘(‘‘,‘‘,‘"‘%s‘" % x, d.split(‘,‘‘)‘forin fpdata[1:]]
testable times unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128), this anomaly estimates novice see very headache, what does this mean Think about it?
That is, when the ASCII string is decode to Unicode, it touches the oxe4 bit, which is not within the ASCII range, and all decode errors.
To account for the project background, we use Python2.7, this is the Django project, Self.game is a Unicode object, sensible One look is sys.setdefaultencoding problem, in fact, is the following question
Ah, but I clearly in the settings.py set the default encoding Ah, look at a bit, only to find a joke
See what's going on? Since fpdata are both UTF8 strings, and Self.game is a Unicode object, the string join will decode the UTF8 string as a Unicode object, but the system does not know that you are the UTF8 encoding, which uses ASCII to decode by default, which is not surprising.
This is why it is recommended to increase the sys.setdefaultencoding ("UTF8") operation when encountering coding problems, but why do you add such an operation? Let's see what's going on from the bottom.
When a string is connected to a Unicode object, that is A+b, the pystring_concat is called.
# stringobject.cVoidpystring_concat (Register Pyobject **PV, register Pyobject *w) {register Pyobject *v;if(*PV = =NULL)return;if(W = =NULL|| ! Pystring_check (*PV)) {py_clear (*PV);return; } v = string_concat ((pystringobject *) *PV, W); Py_decref (*PV); *PV = V;} Static Pyobject *string_concat (register pystringobject *a, register Pyobject *bb) {register py_ssize_t size; Register Pystringobject *op;if(! Pystring_check (BB)) {if(Pyunicode_check (BB))returnPyunicode_concat ((Pyobject *) A, BB);if(Pybytearray_check (BB))returnPybytearray_concat ((Pyobject *) A, BB); Pyerr_format (Pyexc_typeerror,"Cannot concatenate ' str ' and '%.200s ' objects", Py_type (BB)->tp_name);return NULL; }...}
If B is detected as a Unicode object, the Pyunicode_concat is called
Pyobject*pyunicode_concat(Pyobject*left, Pyobject*right) {Pyunicodeobject*u= NULL,*v= NULL,*w;/ * Coerce the arguments * /U = (pyunicodeobject*)Pyunicode_fromobject (left); v = (pyunicodeobject*)Pyunicode_fromobject (right); W = _pyunicode_new (u->length+ v->length); Py_decref (v);return(Pyobject*)W;} Pyobject*pyunicode_fromobject(Register Pyobject*obj){if(Pyunicode_check (obj)) {/ * For a Unicode subtype that's not a Unicode object, and return a true Unicode object with the same data. */ returnPyunicode_fromunicode (Pyunicode_as_unicode (obj), pyunicode_get_size (obj)); }returnPyunicode_fromencodedobject (obj, NULL,"Strict");}
Because A is not a Unicode object calls Pyunicode_fromencodedobject to convert a to a Unicode object, the passed encoding is null
Pyobject *pyunicode_fromencodedobject (RegisterPyobject *obj,Const Char*encoding,Const Char*errors) {Const Char*s =NULL; py_ssize_t Len; Pyobject *v;/ * Coerce Object * / if(Pystring_check (obj)) {s = pystring_as_string (obj); Len = pystring_get_size (obj); }/ * Convert to Unicode * /v = Pyunicode_decode (s, Len, encoding, errors);returnV;} Pyobject *pyunicode_decode (Const Char*s, py_ssize_t size,Const Char*encoding,Const Char*errors) {Pyobject *buffer =NULL, *unicode;if(Encoding = =NULL) encoding = pyunicode_getdefaultencoding ();/ * Shortcuts for common default encodings * / if(strcmp (Encoding,"Utf-8") ==0)returnPyunicode_decodeutf8 (s, size, errors);Else if(strcmp (Encoding,"Latin-1") ==0)returnPyunicode_decodelatin1 (s, size, errors);Else if(strcmp (Encoding,"ASCII") ==0)returnPyunicode_decodeascii (s, size, errors);/ * Decode via the codec registry * /Buffer = Pybuffer_frommemory ((void*) s, size);if(Buffer = =NULL)GotoOnError; Unicode = pycodec_decode (buffer, encoding, errors);returnUnicode;}
We see that when encoding is null, encoding is pyunicode_getdefaultencoding (), which is actually the return value of our sys.getdefaultencoding (), Python default is ASCII
staticchar unicode_default_encoding[1001"ascii";constchar *PyUnicode_GetDefaultEncoding(void){ return unicode_default_encoding;}
Here unicode_default_encoding is a static variable and allocates enough space for you to specify a different encoding, an estimated 100 characters must be enough
We're looking at the getdefaultencoding and setdefaultencoding of the SYS module.
static PyObject *sys_getdefaultencoding(PyObject *self){ return PyString_FromString(PyUnicode_GetDefaultEncoding());}static PyObject *sys_setdefaultencoding(PyObject *self, PyObject *args){ if (PyUnicode_SetDefaultEncoding(encoding)) returnNULL; Py_INCREF(Py_None); return Py_None;}
Pyunicode_setdefaultencoding don't have to think about it. Set the unicode_default_encoding array, Python uses the strncpy
int PyUnicode_ Setdefaultencoding (const char *encoding) { Pyobject *v; /* make sure the encoding is valid. As side effect, this also loads the encoding into the codec registry cache. */ v = _pycodec_lookup (encoding); if (v = = null ) goto onError; Py_decref (v); strncpy (unicode_default_encoding, encoding, sizeof (Unicode_default _encoding)-1 ); return 0 ; OnError: return -1 ;}
Previously we were reload (SYS) at sys.setdefaultencoding ("UTF8") because there was an operation in Python site.py
if"setdefaultencoding"): del sys.setdefaultencoding
Of course you can customize the site.py, modify the setencoding, and use the locale setting, that is, change if 0 to if 1. The settings for general Windows locale are encoded as cp936, and the server is generally UTF8
def setencoding(): "" " Set the string encoding used by the Unicode implementation. The default is ' ASCII ', and if you ' re willing to experiment, you can change this. ""encoding ="ASCII" # Default Value set by _pyunicode_init () if 0:# Enable to support locale aware default string encodings. ImportLocale loc = Locale.getdefaultlocale ()ifloc[1]: encoding = loc[1]if 0:# Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion.encoding ="undefined" ifEncoding! ="ASCII":# on Non-unicode builds this would raise an attributeerror ...Sys.setdefaultencoding (encoding)# Needs Python Unicode Build!
So python coding is not difficult,
To play the Python code you need to know
The difference between 1.unicode and UTF8,GBK, and the conversion of Unicode to specific encoding
2. Strings are converted to Unicode when concatenated with Unicode, and STR (Unicode) is converted to strings
3. When you do not know the specific encoding will use the system default encoding ASCII, can be modified by sys.setdefaultencoding
If you can explain the following, you should be able to play with Python's annoying coding problems.
Playing with Python's annoying coding problems