EXT---python 3 encoding

Source: Internet
Author: User

The collation of PYTHON3 code!!!
PY Encoding Ultimate Edition

Speaking of Python coding, it is a very sad sentence. Counted, repeated two months of tossing and counting. Fortunately, finally combed clear. As a communist, be sure to share it with everyone. If you're still having headaches because of coding, then come with me. Let's uncover the truth of the PY code!

What is a code?
The basic concept is simple. First, we start with a piece of information, the message, that the message exists in a human understandable and understandable sense. I intend to call this expression "plaintext" (plain text). For people who speak English, the English words printed on the paper or displayed on the screen count as clear text.

Second, we need to be able to turn the plaintext message into some other representation, and we need to be able to turn the encoded text back into plaintext. The conversion from plaintext to encoded text is called "encoding", and it is "decoded" from the encoded text back to Cheng Mingwen.

Copy Code
Copy Code
Coding is a big problem, and if it's not solved completely, it will be like a little snake in the jungle, biting you from time to time.
So what is coding anyway?

//ASCII记住一句话:计算机中的所有数据,不论是文字、图片、视频、还是音频文件,本质上最终都是按照类似 01010101 的二进制存储的。再说简单点,计算机只懂二进制数字!所以,目的明确了:如何将我们能识别的符号唯一的与一组二进制数字对应上?于是美利坚的同志想到通过一个电平的高低状态来代指0或1,八个电平做为一组就可以表示出256种不同状态,每种状态就唯一对应一个字符,比如A--->00010001,而英文只有26个字符,算上一些特殊字符和数字,128个状态也够用了;每个电平称为一个比特为,约定8个比特位构成一个字节,这样计算机就可以用127个不同字节来存储英语的文字了。这就是ASCII编码。扩展ANSI编码刚才说了,最开始,一个字节有八位,但是最高位没用上,默认为0;后来为了计算机也可以表示拉丁文,就将最后一位也用上了,从128到255的字符集对应拉丁文啦。至此,一个字节就用满了!//GB2312计算机漂洋过海来到中国后,问题来了,计算机不认识中文,当然也没法显示中文;而且一个字节所有状态都被占满了,万恶的帝国主义亡我之心不死啊!我党也是棒,自力更生,自己重写一张表,直接生猛地将扩展的第八位对应拉丁文全部删掉,规定一个小于127的字符的意义与原来相同,但两个大于127的字符连在一起时,就表示一个汉字,前面的一个字节(他称之为高字节)从0xA1用到0xF7,后面一个字节

(Low-byte) from 0xa1 to 0xFE, so that we can assemble about 7,000 more Simplified Chinese characters, this scheme of Chinese characters is called "GB2312". GB2312 is a Chinese extension to ASCII.

//GBK 和 GB18030编码但是汉字太多了,GB2312也不够用,于是规定:只要第一个字节是大于127就固定表示这是一个汉字的开始,不管后面跟的是不是扩展字符集里的内容。结果扩展之后的编码方案被称为 GBK 标准,GBK 包括了 GB2312 的所有内容,同时又增加了近20000个新的汉字(包括繁体字)和符号。//UNICODE编码:很多其它国家都搞出自己的编码标准,彼此间却相互不支持。这就带来了很多问题。于是,国际标谁化组织为了统一编码:提出了标准编码准则:UNICODE 。UNICODE是用两个字节来表示为一个字符,它总共可以组合出65535不同的字符,这足以覆盖世界上所有符号(包括甲骨文)//utf8:unicode都一统天下了,为什么还要有一个utf8的编码呢?大家想,对于英文世界的人们来讲,一个字节完全够了,比如要存储A,本来00010001就可以了,现在吃上了unicode的大锅饭,得用两个字节:00000000 00010001才行,浪费太严重!基于此,美利坚的科学家们提出了天才的想法:utf8.UTF-8(8-bit Unicode Transformation Format)是一种针对Unicode的可变长度字符编码,它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度,当字符在ASCII码的范围时,就用一个字节表示,所以是兼容ASCII编码的。这样显著的好处是,虽然在我们内存中的数据都是unicode,但当数据要保存到磁盘或者用于网络传输时,直接使用unicode就远不如utf8省空间啦!这也是为什么utf8是我们的推荐编码方式。Unicode与utf8的关系:一言以蔽之:Unicode是内存编码表示方案(是规范),而UTF是如何保存和传输Unicode的方案(是实现)这也是UTF与Unicode的区别。

Copy Code
Copy Code
Add: How UTF8 saves hard drives and traffic
1
s= "I ' m Court Hao"
The Unicode character set you see is such an encoding table:

Copy Code
I 0049
' 0027
M 006d
0020
Yuan 82d1
Hao 660a
Copy Code
Each character corresponds to a hexadecimal number.
Computers only understand binary, so, strictly in Unicode (UCS-2), this should be stored in this way:

Copy Code
I 00000000 01001001
' 00000000 00100111
M 00000000 01101101
00000000 00100000
Court 10000010 11010001
Hao 01100110 00001010
Copy Code
This string occupies a total of 12 bytes, but compared with the binary code in English, you can find that the first 9 bits of English are 0! Wasting, wasting your hard drive, wasting your traffic. What to do? UTF8:

Copy Code
I 01001001
' 00100111
M 01101101
00100000
Court 11101000 10001011 10010001
Hao 11100110 10011000 10001010
Copy Code
UTF8 used 10 bytes, compared to Unicode, two fewer, because our program English will be much more than Chinese, so the space will improve a lot!

Remember: Everything is meant to save your hard drive and traffic.

String encoding of two Py2
In Py2, there are two types of strings: str type and Unicode type; Note that this is just two names, Python defines two names, and the key is what are the two data types that have memory addresses when the program runs?

Let's take a look:

1
2
3
4
5
6
7
8
9
10

Coding:utf8

s1= ' Garden '

Print type (S1) # <type ' str ' >
Print repr (S1) # ' \xe8\x8b\x91

S2=u ' Garden '
Print type (s2) # <type ' Unicode ' >
Print repr (s2) # u ' \u82d1 '
The built-in function repr can help us display the stored content here. It turns out that STR and Unicode each have byte data and Unicode data, so what is the concern between the two types of data? How to convert it? Here's the code (encode) and decoding (decode).

Copy Code
Copy Code
S1=u ' Garden '
Print repr (S1) #u ' \u82d1 '

B=s1.encode (' UTF8 ')
Print B
Print type (b) #<type ' str ' >
Print repr (b) # ' \xe8\x8b\x91 '

s2= ' Yuan Hao '
U=s2.decode (' UTF8 ')
Print U # Court Hao
Print type (u) # <type ' Unicode ' >
Print repr (u) # u ' \u82d1\u660a '

Attention

U2=s2.decode (' GBK ')
Print U2 #鑻戞槉

Print Len (' Court Hao ') #6
Copy Code
Copy Code
Whether it is UTF8 or GBK are just a coding rules, a Unicode data encoding into byte data rules, so UTF8 encoded bytes must be decoded with UTF8 rules, otherwise there will be garbled or error cases.

Features of PY2 encoding:
1
2
3
4
5
6
7
8
9

Coding:utf8

print ' Court Hao ' # Court Hao
Print repr (' Court Hao ') # ' \xe8\x8b\x91\xe6\x98\x8a '

Print (U "Hello" + "Yuan")

Print (U ' Court hao ' + ' most Handsome ') #UnicodeDecodeError: ' ASCII ' codec can ' t decode byte 0xe6
                     # in position 0: ordinal not in range(128)

Python 2 silently hides the byte-to-Unicode conversion, so long as the data is all ASCII, all conversions are correct, and once a non-ASCII character sneaks into your program, the default decoding will be invalidated, resulting in unicodedecodeerror The error. The PY2 encoding makes it easier to process ASCII. The price of your comeback is that it will fail when dealing with non-ASCII.

String encoding of three Py3
Python3 renamed the Unicode type to str with the old STR type have been replaced by bytes.

PY3 also has two data types: str and bytes, the STR type has Unicode data, Bytse type bytes data, and py2 than just a change of name.

Copy Code
Copy Code
Import JSON

s= ' Yuan Hao '
Print (type (s)) #<class ' str ' >
Print (Json.dumps (s)) # "\u82d1\u660a"

B=s.encode (' UTF8 ')
Print (type (b)) # <class ' bytes ' >
Print (b) # b ' \xe8\x8b\x91\xe6\x98\x8a '

U=b.decode (' UTF8 ')
Print (Type (u)) #<class ' str ' >
Print (U) #苑昊
Print (Json.dumps (u)) # "\u82d1\u660a"

Print (Len (' Court Hao ')) # 2
Copy Code
Copy Code

Py3 's Coding philosophy:
Python 3 The most important new feature is probably a clearer distinction between text and binary data, and no longer automatically decodes bytes byte strings. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between them particularly clear. You cannot stitch strings and byte packets, search for strings in a byte packet (or vice versa), or pass a string into a function with a byte packet (or vice versa).

1
2

Print (' Alvin ' +u ' Yuan ') #字节串和unicode连接 Py2:alvinyuan

Print (b ' Alvin ' + ' Yuan ') #字节串和unicode连接 py3: Error can ' t concat bytes to STR
Note: Regardless of the py2, or PY3, the Unicode data directly corresponds to the plaintext, and the printed Unicode data displays the corresponding plaintext (both English and Chinese)

Four files from disk to memory encoding (******)
Speaking of which, just came to our point!

Put aside the implementation of the implementation procedures, I ask everyone, text editor Everyone is used, if not understand what is, then word always used it, OK, when we edit text in Word, whether it is Chinese or English, the computer is not known, then before the data is saved by what form of memory? Yes, is the Unicode data, why to save Unicode data, this is because its name is the most cock: Universal code! The explanation is that no matter English, Chinese, Japanese, Latin, any character in the world it has a unique encoding corresponding, so compatibility is the best.

Okay, so what happens when we keep the data stored on the disk?

The answer is a bytes byte string encoded in some way. For example, utf8---is a variable length coding, which saves space and, of course, the GBK encoding of historical products and so on. So, in our text editor software, there is a default way to save files, such as UTF8, such as GBK. When we click Save, these editing software has "silently" helped us to do the coding work.

That when we open this file again, the software silently to us to do the decoding work, the data will be decoded into Unicode, and then can be rendered clear to the user! So, Unicode is closer to the user, bytes is the data closer to the computer.

What does that have to do with the execution of our program?

Let's start by defining a concept: the PY interpreter itself is a software, like a text editor!

Now let's restore a py file from creation to execution of the encoding process:

Open Pycharm, create hello.py file, write

Ret=1+1
s= ' Yuan Hao '
Print (s)
When we save the time, the hello.py file is saved to the disk by the default encoding Pycharm, the file is closed and then opened, Pycharm then the default encoding to read after the opening of the content to decode, turn to Unicode to memory we see our plaintext;

  而如果我们点击运行按钮或者在命令行运行该文件时,py解释器这个软件就会被调用,打开文件,然后解码存在磁盘上的bytes数据成unicode数据,这个过程和编辑器是一样的,不同的是解释器会再将这些unicode数据翻译成C代码再转成二进制的数据流,最后通过控制操作系统调用cpu来执行这些二进制数据,整个过程才算结束。

So the question comes, our text editor has its own default encoding and decoding method, does our interpreter have it?

Of course, py2 default ASCII code, py3 default UTF8, can be queried by the following way

1
2
Import Sys
Print (sys.getdefaultencoding ())
Do you remember this statement?

1

Coding:utf8

Yes, this is because if the PY2 interpreter to execute a UTF8 encoded file, it will be decoded by default ASCII UTF8, once the program has Chinese, natural decoding error, so we declare at the beginning of the file #coding: UTF8, in fact, is to tell the interpreter, You should not decode this file by default encoding, but instead use UTF8 to decode it. The PY3 interpreter is much more convenient because it is encoded by default UTF8.

Note: The string encoding we mentioned above is the storage state of the CPU when executing the program, is another process, do not confuse!

Five Common coding problems
1 The garbled problem under cmd
hello.py

1
2

Coding:utf8

Print (' Court Hao ')
The encoding of the file when it is saved is also UTF8.

Think: Why under the IDE with 2 or 3 to do all right, under the Cmd.exe 3 is correct, 2 garbled it?

  我们在win下的终端即cmd.exe去执行,大家注意,cmd.exe本身也一个软件;当我们python2 hello.py时,python2解释器(默认ASCII编码)去按声明的utf8编码文件,而文件又是utf8保存的,所以没问题;问题出在当我们print'苑昊'时,解释器这边正常执行,也不会报错,只是print的内容会传递给cmd.exe用来显示,而在py2里这个内容就是utf8编码的字节数据,可这个软件默认的编码解码方式是GBK,所以cmd.exe用GBK的解码方式去解码utf8自然会乱码。

Py3 The correct reason is that the Unicode data is passed to CMD, Cmd.exe can recognize the content, so the display is fine.

Understand the principle, there are many ways to modify, such as:

1
Print (U ' Yuan Hao ')
After changing to this, the CMD under 2 will not have a problem.

2 encoding issues in open ()
Create a hello text and save it as UTF8:

Yuanhao, you are the most handsome!
Create a index.py in the same directory

F=open (' Hello ')
Print (F.read ())
Why under Linux, the result is normal: Yuanhao, under win, garbled: Ã language 槉 (PY3 interpreter)?

Because your win's operating system is installed with the default GBK encoding, while the Linux operating system defaults to UTF8 encoding;

When the open function is executed, the operating system is called the opening file, the operating system with the default GBK encoding to decode UTF8 files, natural garbled.

Workaround:

F=open (' Hello ', encoding= ' UTF8 ')
Print (F.read ())
If your file is GBK encoded, you will not have to specify encoding under win.

In addition, if your win does not need to assign to the operating system encoding= ' UTF8 ', that is, you install the default UTF8 code or has been modified by the command to UTF8 encoding.

Note: The open function is different in py2 and Py3, and there is a encoding=none parameter in the PY3.

EXT---python 3 encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.