Python-annoying Coding
The encoding problem of Python is basically a problem that every newbie will encounter, but as long as he has mastered it completely, he will skip this pitfall and leave it alone. I have encountered this problem not recently, let's take a look.
The reason is that a review upload function is implemented by a colleague. Check the following code. self. fp is the Upload File handle.
fpdata = [line.strip().decode('gbk').encode('utf-8').decode('utf-8') for line in self.fp]data = [''.join(['(', self.game, ',', ','.join(map(lambda x: "'%s'" % x, d.split(','))), ')']) for d in fpdata[1:]]
This code exposes two problems.
1. The default encoding is gbk. Why not use utf8?
2. encode ('utf-8'). decode ('utf-8') is completely unnecessary. After decode ('gbk'), it is unicode.
I suggest uploading text encoding to utf8, so the code becomes like this?
fpdata = [line.strip() for line in self.fp if line.strip()]data = [''.join(['(', self.game, ',', ','.join(map(lambda x: "'%s'" % x, d.split(','))), ')']) for d in fpdata[1:]]
If UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128) is reported, it is estimated that this exception is a headache for new users. What does this mean?
That is to say, when the ascii string decode is unicode, The oxe4 bit is encountered, which is not within the ascii range and all decode errors.
Let's take a look at the project background. We use Python2.7, which is a django project. self. game is a unicode object. You can see that sys. setdefaultencoding is actually the problem below.
Oh, but I set the default encoding in settings. py. I checked it and found a joke.
Let's see what happened? Because fpdata is a UTF-8 string, and self. game is a unicode object. During string join, utf8 strings are decoded as unicode objects. However, the system does not know that you are utf8 encoded. By default, ascii is used for decoding. This error is not surprising.
This is why we recommend that you add the sys. setdefaultencoding ("utf8") operation when you encounter Encoding Problems, but why? Let's look at what happened from the bottom layer?
When a string is connected to a unicode object, that is, a + B, PyString_Concat is called.
# stringobject.cvoidPyString_Concat(register PyObject **pv, register PyObject *w){ register PyObject *v; if (*pv == NULL) return; if (w == NULL || !PyString_Check(*pv)) { Py_CLEAR(*pv); return; } v = string_concat((PyStringObject *) *pv, w); Py_DECREF(*pv); *pv = v;}static PyObject *string_concat(register PyStringObject *a, register PyObject *bb){ register Py_ssize_t size; register PyStringObject *op; if (!PyString_Check(bb)) { if (PyUnicode_Check(bb)) return PyUnicode_Concat((PyObject *)a, bb); if (PyByteArray_Check(bb)) return PyByteArray_Concat((PyObject *)a, bb); PyErr_Format(PyExc_TypeError, "cannot concatenate 'str' and '%.200s' objects", Py_TYPE(bb)->tp_name); return NULL; } ...}
If B is detected to be a unicode object, PyUnicode_Concat is called.
PyObject *PyUnicode_Concat(PyObject *left, PyObject *right){ PyUnicodeObject *u = NULL, *v = NULL, *w; /* Coerce the two arguments */ u = (PyUnicodeObject *)PyUnicode_FromObject(left); v = (PyUnicodeObject *)PyUnicode_FromObject(right); w = _PyUnicode_New(u->length + v->length); Py_DECREF(v); return (PyObject *)w;}PyObject *PyUnicode_FromObject(register PyObject *obj){ if (PyUnicode_Check(obj)) { /* For a Unicode subtype that's not a Unicode object, return a true Unicode object with the same data. */ return PyUnicode_FromUnicode(PyUnicode_AS_UNICODE(obj), PyUnicode_GET_SIZE(obj)); } return PyUnicode_FromEncodedObject(obj, NULL, "strict");}
Since a is not a unicode object, PyUnicode_FromEncodedObject is called to convert a to a unicode object. The passed encoding is NULL.
PyObject *PyUnicode_FromEncodedObject(register PyObject *obj, const char *encoding, const char *errors){ const char *s = NULL; Py_ssize_t len; PyObject *v; /* Coerce object */ if (PyString_Check(obj)) { s = PyString_AS_STRING(obj); len = PyString_GET_SIZE(obj); } /* Convert to Unicode */ v = PyUnicode_Decode(s, len, encoding, errors); return v;}PyObject *PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors){ PyObject *buffer = NULL, *unicode; if (encoding == NULL) encoding = PyUnicode_GetDefaultEncoding(); /* Shortcuts for common default encodings */ if (strcmp(encoding, "utf-8") == 0) return PyUnicode_DecodeUTF8(s, size, errors); else if (strcmp(encoding, "latin-1") == 0) return PyUnicode_DecodeLatin1(s, size, errors); else if (strcmp(encoding, "ascii") == 0) return PyUnicode_DecodeASCII(s, size, errors); /* Decode via the codec registry */ buffer = PyBuffer_FromMemory((void *)s, size); if (buffer == NULL) goto onError; unicode = PyCodec_Decode(buffer, encoding, errors); return unicode;}
When encoding is NULL, encoding is PyUnicode_GetDefaultEncoding (). In fact, this is the returned value of sys. getdefaultencoding (). By default, Python is ascii.
static char unicode_default_encoding[100 + 1] = "ascii";const char *PyUnicode_GetDefaultEncoding(void){ return unicode_default_encoding;}
Here unicode_default_encoding is a static variable with enough space for you to specify different encodings. It is estimated that 100 characters are enough.
Let's take a look at getdefaultencoding and setdefaultencoding of the sys module.
static PyObject *sys_getdefaultencoding(PyObject *self){ return PyString_FromString(PyUnicode_GetDefaultEncoding());}static PyObject *sys_setdefaultencoding(PyObject *self, PyObject *args){ if (PyUnicode_SetDefaultEncoding(encoding)) return NULL; Py_INCREF(Py_None); return Py_None;}
PyUnicode_SetDefaultEncoding. You don't need to set the unicode_default_encoding array. Python uses strncpy.
int PyUnicode_SetDefaultEncoding(const char *encoding){ PyObject *v; /* Make sure the encoding is valid. As side effect, this also loads the encoding into the codec registry cache. */ v = _PyCodec_Lookup(encoding); if (v == NULL) goto onError; Py_DECREF(v); strncpy(unicode_default_encoding, encoding, sizeof(unicode_default_encoding) - 1); return 0; onError: return -1;}
Previously, we reload (sys) in sys. setdefaultencoding ("utf8") because there is such an operation in Python site. py.
if hasattr(sys, "setdefaultencoding"): del sys.setdefaultencoding
Of course, you can customize site. py, modify setencoding, and use locale settings, that is, change if 0 to if 1. Generally, in windows, the locale encoding is cp936, and the server is generally utf8.
def setencoding(): """Set the string encoding used by the Unicode implementation. The default is 'ascii', but if you're willing to experiment, you can change this.""" encoding = "ascii" # Default value set by _PyUnicode_Init() if 0: # Enable to support locale aware default string encodings. import locale loc = locale.getdefaultlocale() if loc[1]: encoding = loc[1] if 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. encoding = "undefined" if encoding != "ascii": # On Non-Unicode builds this will raise an AttributeError... sys.setdefaultencoding(encoding) # Needs Python Unicode build !
Therefore, Python encoding is not difficult,
You need to know how to encode Python.
1. Differences between unicode, utf8, and gbk, and conversion between unicode and specific Encoding
2. When a string is connected with unicode, it is converted to unicode, and str (unicode) is converted to a string.
3. If you do not know the specific encoding, the system will use the default ascii code, which can be modified through sys. setdefaultencoding.
If you can explain the following phenomena, you should be able to play with Python's annoying coding problems.