Playing with Python's annoying coding problems

Source: Internet
Author: User
Tags locale locale setting strcmp

Python coding problem is basically every novice will encounter the threshold, but as long as the full grasp of the jump over the pit, million change it, this is not recently I have encountered this problem, come together to see it.

The cause of the matter is review colleague to do an upload function, see the following section of code, SELF.FP is an uploaded file handle

fpdata = [line.strip().decode(‘gbk‘).encode(‘utf-8‘).decode(‘utf-8‘forin self.fp]data = [‘‘.join([‘(‘‘,‘‘,‘.join(map(lambda"‘%s‘" % x, d.split(‘,‘‘)‘forin fpdata[1:]]

This code exposes 2 issues
1. The default encoding uses GBK, why not UTF8?
2..encode (' Utf-8 '). Decode (' Utf-8 ') is completely unnecessary, and decode (' GBK ') is already Unicode.

I recommend uploading text encoded as UTF8, so the code becomes this?

fpdata = [lineforlineinifline.strip()]data = [‘‘.join([‘(‘‘,‘‘,‘"‘%s‘" % x, d.split(‘,‘‘)‘forin fpdata[1:]]

testable times unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128), this anomaly estimates novice see very headache, what does this mean Think about it?
That is, when the ASCII string is decode to Unicode, it touches the oxe4 bit, which is not within the ASCII range, and all decode errors.

To account for the project background, we use Python2.7, this is the Django project, Self.game is a Unicode object, sensible One look is sys.setdefaultencoding problem, in fact, is the following question

Ah, but I clearly in the settings.py set the default encoding Ah, look at a bit, only to find a joke

See what's going on? Since fpdata are both UTF8 strings, and Self.game is a Unicode object, the string join will decode the UTF8 string as a Unicode object, but the system does not know that you are the UTF8 encoding, which uses ASCII to decode by default, which is not surprising.

This is why it is recommended to increase the sys.setdefaultencoding ("UTF8") operation when encountering coding problems, but why do you add such an operation? Let's see what's going on from the bottom.

When a string is connected to a Unicode object, that is A+b, the pystring_concat is called.

# stringobject.cVoidpystring_concat (Register Pyobject **PV, register Pyobject *w) {register Pyobject *v;if(*PV = =NULL)return;if(W = =NULL|| ! Pystring_check (*PV)) {py_clear (*PV);return;    } v = string_concat ((pystringobject *) *PV, W);    Py_decref (*PV); *PV = V;}    Static Pyobject *string_concat (register pystringobject *a, register Pyobject *bb) {register py_ssize_t size; Register Pystringobject *op;if(! Pystring_check (BB)) {if(Pyunicode_check (BB))returnPyunicode_concat ((Pyobject *) A, BB);if(Pybytearray_check (BB))returnPybytearray_concat ((Pyobject *) A, BB); Pyerr_format (Pyexc_typeerror,"Cannot concatenate ' str ' and '%.200s ' objects", Py_type (BB)->tp_name);return NULL; }...}

If B is detected as a Unicode object, the Pyunicode_concat is called

Pyobject*pyunicode_concat(Pyobject*left, Pyobject*right) {Pyunicodeobject*u= NULL,*v= NULL,*w;/ * Coerce the arguments * /U = (pyunicodeobject*)Pyunicode_fromobject (left); v = (pyunicodeobject*)Pyunicode_fromobject (right); W = _pyunicode_new (u->length+ v->length); Py_decref (v);return(Pyobject*)W;} Pyobject*pyunicode_fromobject(Register Pyobject*obj){if(Pyunicode_check (obj)) {/ * For a Unicode subtype that's not a Unicode object, and return a true Unicode object with the same data. */        returnPyunicode_fromunicode (Pyunicode_as_unicode (obj), pyunicode_get_size (obj)); }returnPyunicode_fromencodedobject (obj, NULL,"Strict");}

Because A is not a Unicode object calls Pyunicode_fromencodedobject to convert a to a Unicode object, the passed encoding is null

Pyobject *pyunicode_fromencodedobject (RegisterPyobject *obj,Const Char*encoding,Const Char*errors) {Const Char*s =NULL;    py_ssize_t Len; Pyobject *v;/ * Coerce Object * /    if(Pystring_check (obj))        {s = pystring_as_string (obj);    Len = pystring_get_size (obj); }/ * Convert to Unicode * /v = Pyunicode_decode (s, Len, encoding, errors);returnV;} Pyobject *pyunicode_decode (Const Char*s, py_ssize_t size,Const Char*encoding,Const Char*errors) {Pyobject *buffer =NULL, *unicode;if(Encoding = =NULL) encoding = pyunicode_getdefaultencoding ();/ * Shortcuts for common default encodings * /    if(strcmp (Encoding,"Utf-8") ==0)returnPyunicode_decodeutf8 (s, size, errors);Else if(strcmp (Encoding,"Latin-1") ==0)returnPyunicode_decodelatin1 (s, size, errors);Else if(strcmp (Encoding,"ASCII") ==0)returnPyunicode_decodeascii (s, size, errors);/ * Decode via the codec registry * /Buffer = Pybuffer_frommemory ((void*) s, size);if(Buffer = =NULL)GotoOnError; Unicode = pycodec_decode (buffer, encoding, errors);returnUnicode;}

We see that when encoding is null, encoding is pyunicode_getdefaultencoding (), which is actually the return value of our sys.getdefaultencoding (), Python default is ASCII

staticchar unicode_default_encoding[1001"ascii";constchar *PyUnicode_GetDefaultEncoding(void){    return unicode_default_encoding;}

Here unicode_default_encoding is a static variable and allocates enough space for you to specify a different encoding, an estimated 100 characters must be enough

We're looking at the getdefaultencoding and setdefaultencoding of the SYS module.

static PyObject *sys_getdefaultencoding(PyObject *self){    return PyString_FromString(PyUnicode_GetDefaultEncoding());}static PyObject *sys_setdefaultencoding(PyObject *self, PyObject *args){    if (PyUnicode_SetDefaultEncoding(encoding))        returnNULL;    Py_INCREF(Py_None);    return Py_None;}

Pyunicode_setdefaultencoding don't have to think about it. Set the unicode_default_encoding array, Python uses the strncpy

int  PyUnicode_    Setdefaultencoding (const  char  *encoding) {    Pyobject *v; /* make sure the encoding is valid. As side effect, this also loads the encoding into the codec registry cache.    */ v = _pycodec_lookup (encoding); if         (v = = null )    goto  onError;    Py_decref (v); strncpy (unicode_default_encoding, encoding, sizeof  (Unicode_default    _encoding)-1 ); return   0 ; OnError: return -1 ;}  

Previously we were reload (SYS) at sys.setdefaultencoding ("UTF8") because there was an operation in Python site.py

    if"setdefaultencoding"):        del sys.setdefaultencoding

Of course you can customize the site.py, modify the setencoding, and use the locale setting, that is, change if 0 to if 1. The settings for general Windows locale are encoded as cp936, and the server is generally UTF8

 def setencoding():    "" " Set the string encoding used by the Unicode implementation. The default is ' ASCII ', and if you ' re willing to experiment, you can change this. ""encoding ="ASCII" # Default Value set by _pyunicode_init ()    if 0:# Enable to support locale aware default string encodings.        ImportLocale loc = Locale.getdefaultlocale ()ifloc[1]: encoding = loc[1]if 0:# Enable to switch off string to Unicode coercion and implicit        # Unicode to string conversion.encoding ="undefined"    ifEncoding! ="ASCII":# on Non-unicode builds this would raise an attributeerror ...Sys.setdefaultencoding (encoding)# Needs Python Unicode Build!

So python coding is not difficult,
To play the Python code you need to know
The difference between 1.unicode and UTF8,GBK, and the conversion of Unicode to specific encoding
2. Strings are converted to Unicode when concatenated with Unicode, and STR (Unicode) is converted to strings
3. When you do not know the specific encoding will use the system default encoding ASCII, can be modified by sys.setdefaultencoding

If you can explain the following, you should be able to play with Python's annoying coding problems.

Playing with Python's annoying coding problems

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.