004-python Basics-Character encoding and transcoding

Source: Internet
Author: User

One or three ways to encode

    1. ASCII: is a computer coding system based on the Latin alphabet, mainly used to display modern English and other Western European languages, which can only be represented by a maximum of 8 bits (one byte), that is: 2**8 = 256-1, so the ASCII code can only represent a maximum of 255 symbols.
    2. Unicode (Uniform Code, universal Code, Single code): a character encoding used on a computer that specifies that all characters and symbols are represented by at least 16 bits (2 bytes), that is: 2 **16 = 65536.
    3. UTF-8: Is the compression and optimization of Unicode encoding, he no longer uses a minimum of 2 bytes, but all the characters and symbols are categorized: the contents of the ASCII code is saved in 1 bytes, the European characters are saved in 2 bytes, the characters in East Asia are saved in 3 bytes. UTF8, because it is a variable long byte encoding, it saves a lot of space when storing files, and is compatible with ASCII code.

Detailed article:

Http://www.cnblogs.com/yuanchenqi/articles/5956943.html (PY Encoding Ultimate Edition)

Word: Unicode is a memory-encoded representation scheme (a specification), and UTF is a scheme for how to save and transmit Unicode (implementation), which is also the difference between UTF and Unicode.

In.

II. encoding of files from disk to memory

What about the data that we saved on the disk?

The answer is a bytes byte string encoded in some way. For example, utf8---is a variable length coding, which saves space and, of course, the GBK encoding of historical products and so on. So, in our text editor software, there is a default way to save files, such as UTF8, such as GBK. When we click Save, these editing software has "silently" helped us to do the coding work.

That when we open this file again, the software silently to us to do the decoding work, the data will be decoded into Unicode, and then can be rendered clear to the user!

So, Unicode is closer to the user, bytes is the data closer to the computer.

What does that have to do with the execution of our program?

Let's start by defining a concept: the PY interpreter itself is a software, like a text editor!

Now let's restore a py file from creation to execution of the encoding process:

Open Pycharm, create hello.py file, write

ret=1+1s= ' Court Hao ' print (s)

When we save the time, the hello.py file is saved to the disk by the default encoding Pycharm, the file is closed and then opened, Pycharm then the default encoding to read after the opening of the content to decode, turn to Unicode to memory we see our plaintext;

And if we click the Run button or run the file at the command line, the PY interpreter will be called, open the file, and decode the bytes data on the disk into Unicode data, the process is the same as the editor, The difference is that the interpreter will then translate these Unicode data into the C code and then into the binary data stream, and finally execute the binary data by controlling the operating system call CPU, the whole process is finished.

So the question comes, our text editor has its own default encoding and decoding method, does our interpreter have it?

Of course, py2 default ASCII code, py3 default UTF8, can be queried by the following way

1 Import SYS 2 Print (Sys.getdefaultencoding ())

Do you remember this statement?

1 # Coding:utf8

Yes, this is because if the PY2 interpreter to execute a UTF8 encoded file, it will be decoded by default ASCII UTF8, once the program has Chinese, natural decoding error, so we declare at the beginning of the file #coding: UTF8, in fact, is to tell the interpreter, You should not decode this file by default encoding, but instead use UTF8 to decode it. The PY3 interpreter is much more convenient because it is encoded by default UTF8.

Second, transcoding

Need to know:

1. In Python2 the default encoding is ASCII, python3 default is Utf-8

2.unicode is divided into utf-32 (4 bytes), utf-16 (two bytes), Utf-8 (1-4 bytes), so Utf-8 is Unicode

3. Encode in Py3, while transcoding will also change the string to bytes type, decode decoding will also turn bytes back to string

  

1 #-*-coding:utf-8-*-2 __author__='Alex Li'3 4 ImportSYS5 Print(Sys.getdefaultencoding ())6 7 8msg ="I love Beijing Tian ' an gate"9msg_gb2312 = Msg.decode ("Utf-8"). Encode ("gb2312")TenGB2312_TO_GBK = Msg_gb2312.decode ("GBK"). Encode ("GBK") One  A Print(msg) - Print(msg_gb2312) - Print(GB2312_TO_GBK)
In Python2
1 #-*-coding:gb2312-*-#这个也可以去掉2 __author__='Alex Li'3 4 ImportSYS5 Print(Sys.getdefaultencoding ())6 7 8msg ="I love Beijing Tian ' an gate"9 #msg_gb2312 = Msg.decode ("Utf-8"). Encode ("gb2312")Tenmsg_gb2312 = Msg.encode ("gb2312")#The default is Unicode, no more decode, hi Big Ben OneGb2312_to_unicode = Msg_gb2312.decode ("gb2312") AGb2312_to_utf8 = Msg_gb2312.decode ("gb2312"). Encode ("Utf-8") -  - Print(msg) the Print(msg_gb2312) - Print(Gb2312_to_unicode) - Print(Gb2312_to_utf8)
In Python3

004-python Basics-Character encoding and transcoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.