Learning Python, the conversion between character encodings is a stumbling block, not to fully understand the code, one day it will be caught in the pit you.
python2.x and python3.x in the character encoding settings also have a very big difference (Python3 future will be the mainstream, so Python3 mainly), today we come together to learn.
In the previous article I have outlined the common Python code, here will not repeat, not clear the small partners can first go to see: http://www.cnblogs.com/schut/p/8406897.html
I. The disputes between Unicode and UTF-8
Unicode plays 2 roles:
- Directly support all languages in the world, and each country can no longer use its own old code, Unicode. (Just like English is a universal language)
- Unicode contains a mapping relationship that is encoded with all countries in the world.
Unicode solves the correspondence between character and binary, but using Unicode to represent one character is a waste of space. For example, using Unicode to denote "Python" requires 12 bytes to be represented, which is 1 time times more than the original ASCII representation. Because the computer's memory is larger, and the string is not particularly large in the content, so the content can be processed using Unicode, but the storage and network transmission of the general data will be very much, then 1 time times the increase will be intolerable!!!
In order to solve the problem of storage and network transmission, the Unicode transformation Format, academic name UTF, namely: convert in Unicode, so as to save space in storage and network transmission!
UTF-8: Use 1, 2, 3, 4 bytes for all characters, a priority of 1 characters, not enough to increase one byte, up to 4 bytes. English accounted for 1 bytes, European language accounted for 2, East Asia accounted for 3, and other and special characters accounted for 4.
UTF-16: Use 2, 4 bytes to represent all characters, 2 bytes preferred, otherwise 4 bytes are used.
UTF-32: Uses 4 bytes to represent all characters.
Summary:UTF is a coding scheme designed for Unicode encoding that saves space in storage and transmission.
Second, the character on the hard disk storage
The first thing to be clear is that no matter what encoding is displayed in memory, the storage on the hard disk is 2 binary. It is important to understand this point.
Like what:
ascii编码(美国): l 0b1101100 o 0b1101111 v 0b1110110 e 0b1100101 GBK编码(中国): 老 0b11000000 0b11001111 男 0b11000100 0b11010000 孩 0b10111010 0b10100010
还要注意的一点是,
要注意的是,存到硬盘上时是以何种编码存的,再从硬盘上读出来时,
就必须以何种编码读(开头声明或转换),要不然就乱了。
三、编码的转换
虽然有了unicode and utf-8 ,但是由于历史问题,各个国家依然在大量使用自己的编码,
比如中国的windows,默认编码依然是gbk,而不是utf-8。
基于此,如果中国的软件出口到美国,在美国人的电脑上就会显示乱码,因为他们没有gbk编码。
所以该怎么办呢?
还记得我们讲unicode其中一个功能是其包含了跟全球所有国家编码的映射关系,这时就派上用场了。
无论你以什么编码存储的数据,只要你的软件在把数据从硬盘读到内存里,转成unicode来显示,就可以了。
由于所有的系统、编程语言都默认支持unicode,那你的gbk软件放到美国电脑上,加载到内存里,变成了unicode,
中文就可以正常展示啦。
Python3执行过程
- The interpreter finds the code file, loads the code string into memory as defined by the file header, and turns it into Unicode
- Interpret the code string as a syntax rule
- All variable characters are declared in Unicode encoding
在py3上 把你的代码以utf-8编写, 保存,然后在windows上执行。
发现可以正常执行!
但是这只是python3, 并不是所有的编程语言在内存里默认编码都是unicode,比如 万恶的python2 就不是,
是ASCII,想写中文,就必须声明文件头的coding为gbk or utf-8, 声明之后,python2解释器
仅以文件头声明的编码去解释你的代码,加载到内存后,并不会主动帮你转为unicode,也就是说,你的文件编码是utf-8,
加载到内存里,你的变量字符串就也是utf-8, 这意味着什么?意味着,你以utf-8编码的文件,
在windows是乱码。
In fact, chaos is normal, not chaos is not normal, because there are only 2 kinds of situations, your Windows display will not mess. Python2 does not automatically convert the file encoding to Unicode presence memory,
- The string is displayed in GBK format
- string is Unicode encoded
所以我们只有手动转,Python3 自动把文件编码转为unicode必定是调用了什么方法,这个方法就是,decode(解码) 和encode(编码)。
方法如下:
UTF-8 --> decode 解码 --> Unicode
U
nicode
< Span class= "token operator" > -- > Encode code --> GBK Span class= "token operator" >/UTF
< Span class= "token operator" > < Span class= "token operator" > -8
例如:
#! /usr/bin/env Python3#-*-coding:utf-8-*-#Write by Congcongs='hurried'Print(s) S1= S.decode ("Utf-8")#Utf-8 turn into Unicode,decode (decoding) need to indicate the current encoding formatPrint(S1,type (S1)) S2= S1.encode ("GBK")#Unicode to Gbk,encode (encoding) needs to indicate the generated encoding formatPrint(S2,type (S2)) S3= S1.encode ("Utf-8")#Unicode to Utf-8,encode (encoded) indicates the generated encoding formatPrint(S3,type (S3))
The rules are as follows:
四、如何验证编码转对了呢?
1, look at the data type, Python 2 has a special Unicode type
2. View Unicode Encoding mapping table
unicode字符是有专门的unicode类型来判断的,但是utf-8,gbk编码的字符都是str,
你如果分辨出来的当前的字符串数据是何种编码的呢? 有人说可以通过字节长度判断,
因为utf-8一个中文占3字节,gbk一个占2字节。
看输出的字节个数,也能大体判断是什么类型。精确的验证一个字符的编码呢,就是拿这些16进制的数跟编码表里去匹配。
详细过程可以看这里:https://www.luffycity.com/python-book/di-3-zhang-python-ji-chu-2014-wen-jian-cao-4f5c26-han-shu/33-zi-fu-bian-ma-zhuan-huan.html
五、Python bytes类型
把8个二进制一组称为一个byte,用16进制来表示。为的就是让人们看起来更可读。我们称之为bytes类型,即字节类型。
python2的字符串其实更应该称为字节串。 通过存储方式就能看出来, 但python2里还有一个类型是bytes呀,难道又叫bytes又叫字符串?
嗯 ,是的,在python2里,bytes == str , 其实就是一回事。
除此之外呢, python2里还有个单独的类型是unicode , 把字符串解码后,就会变成unicode。
>>>s'\xe8\xb7\xaf\xe9\xa3\x9e' #Utf-8>>> S.decode ('Utf-8') U'\u8def\u98de' #where Unicode corresponds in the Unicode encoding table>>>Print(S.decode ('Utf-8')) Lu Fei#characters in Unicode format
< Span class= "token punctuation" > < Span class= "token punctuation" > Python2 the default encoding is ASCII code, Python was prepared to introduce Unicode when the calls to support kanji, Japanese, and French were more and more high, but it was unrealistic to change the
default encoding to Unicode directly, since many software was developed based on the previous default encoding ASCII, and the code was changed, The code for those software is all messed up. So Python 2
just makes a new character type, called the Unicode type, for example, you want your Chinese to display properly on all the computers in the world, and in memory you have to put the string into Unicode type.
/pre>
" Lu Fei ">>> s'\xe8\xb7\xaf\xe9\xa3\x9e'>>> s2 = S.decode ( " Utf-8 " )>>> s2u'\u8def\u98de'>>> type (s2) 'Unicode'>
Python3 in addition to the string encoding to Unicode, but also the Str and bytes made a clear distinction, STR is Unicode format characters, bytes is simply binary.
In Py3, the characters must be Unicode encoded, and all other encodings are displayed in the bytes format.
Python as long as there are various coding problems, there is nothing wrong with the encoding settings
Common coding errors are caused by the following: < Span class= "token punctuation" > /span>
- python interpreter default encoding
- Span style= "font-family:"microsoft yahei"; font-size:16px ">python source file encoding
- terminal using the encoding
- operating system language Settings
< Span class= "token punctuation" > < Span class= "token number" > Python3 file default encoding is Utf-8, the string encoding is Unicode
code encoded as Utf-8 or GBK, loaded into memory, automatically converted to uni Code is displayed normally. The
Python2 file default encoding is ASCII, the string encoding is also ASCII, and if the file header declares to be GBK, the string encoding is GBK.
code>
coded code such as Utf-8 or GBK, loaded into memory and not converted to Unicode, Encoding is still utf-8 or GBK encoding.
以上内容搬运和修改自:https://www.luffycity.com/python-book/di-3-zhang-python-ji-chu-2014-wen-jian-cao-4f5c26-han-shu/33-zi-fu-bian-ma-zhuan-huan.html
Conversion between common character encodings in Python