004-python Basics-Character encoding and transcoding

Last Update:2016-11-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One or three ways to encode

ASCII: is a computer coding system based on the Latin alphabet, mainly used to display modern English and other Western European languages, which can only be represented by a maximum of 8 bits (one byte), that is: 2**8 = 256-1, so the ASCII code can only represent a maximum of 255 symbols.
Unicode (Uniform Code, universal Code, Single code): a character encoding used on a computer that specifies that all characters and symbols are represented by at least 16 bits (2 bytes), that is: 2 **16 = 65536.
UTF-8: Is the compression and optimization of Unicode encoding, he no longer uses a minimum of 2 bytes, but all the characters and symbols are categorized: the contents of the ASCII code is saved in 1 bytes, the European characters are saved in 2 bytes, the characters in East Asia are saved in 3 bytes. UTF8, because it is a variable long byte encoding, it saves a lot of space when storing files, and is compatible with ASCII code.

Detailed article:

Http://www.cnblogs.com/yuanchenqi/articles/5956943.html (PY Encoding Ultimate Edition)

Word: Unicode is a memory-encoded representation scheme (a specification), and UTF is a scheme for how to save and transmit Unicode (implementation), which is also the difference between UTF and Unicode.

In.

II. encoding of files from disk to memory

What about the data that we saved on the disk?

The answer is a bytes byte string encoded in some way. For example, utf8---is a variable length coding, which saves space and, of course, the GBK encoding of historical products and so on. So, in our text editor software, there is a default way to save files, such as UTF8, such as GBK. When we click Save, these editing software has "silently" helped us to do the coding work.

That when we open this file again, the software silently to us to do the decoding work, the data will be decoded into Unicode, and then can be rendered clear to the user!

So, Unicode is closer to the user, bytes is the data closer to the computer.

What does that have to do with the execution of our program?

Let's start by defining a concept: the PY interpreter itself is a software, like a text editor!

Now let's restore a py file from creation to execution of the encoding process:

Open Pycharm, create hello.py file, write

ret=1+1s= ' Court Hao ' print (s)

When we save the time, the hello.py file is saved to the disk by the default encoding Pycharm, the file is closed and then opened, Pycharm then the default encoding to read after the opening of the content to decode, turn to Unicode to memory we see our plaintext;

And if we click the Run button or run the file at the command line, the PY interpreter will be called, open the file, and decode the bytes data on the disk into Unicode data, the process is the same as the editor, The difference is that the interpreter will then translate these Unicode data into the C code and then into the binary data stream, and finally execute the binary data by controlling the operating system call CPU, the whole process is finished.

So the question comes, our text editor has its own default encoding and decoding method, does our interpreter have it?

Of course, py2 default ASCII code, py3 default UTF8, can be queried by the following way

1 Import SYS 2 Print (Sys.getdefaultencoding ())

Do you remember this statement?

1 # Coding:utf8

Yes, this is because if the PY2 interpreter to execute a UTF8 encoded file, it will be decoded by default ASCII UTF8, once the program has Chinese, natural decoding error, so we declare at the beginning of the file #coding: UTF8, in fact, is to tell the interpreter, You should not decode this file by default encoding, but instead use UTF8 to decode it. The PY3 interpreter is much more convenient because it is encoded by default UTF8.

Second, transcoding

Need to know:

1. In Python2 the default encoding is ASCII, python3 default is Utf-8

2.unicode is divided into utf-32 (4 bytes), utf-16 (two bytes), Utf-8 (1-4 bytes), so Utf-8 is Unicode

3. Encode in Py3, while transcoding will also change the string to bytes type, decode decoding will also turn bytes back to string

1 #-*-coding:utf-8-*-2 __author__='Alex Li'3 4 ImportSYS5 Print(Sys.getdefaultencoding ())6 7 8msg ="I love Beijing Tian ' an gate"9msg_gb2312 = Msg.decode ("Utf-8"). Encode ("gb2312")TenGB2312_TO_GBK = Msg_gb2312.decode ("GBK"). Encode ("GBK") One  A Print(msg) - Print(msg_gb2312) - Print(GB2312_TO_GBK)

In Python2

1 #-*-coding:gb2312-*-#这个也可以去掉2 __author__='Alex Li'3 4 ImportSYS5 Print(Sys.getdefaultencoding ())6 7 8msg ="I love Beijing Tian ' an gate"9 #msg_gb2312 = Msg.decode ("Utf-8"). Encode ("gb2312")Tenmsg_gb2312 = Msg.encode ("gb2312")#The default is Unicode, no more decode, hi Big Ben OneGb2312_to_unicode = Msg_gb2312.decode ("gb2312") AGb2312_to_utf8 = Msg_gb2312.decode ("gb2312"). Encode ("Utf-8") -  - Print(msg) the Print(msg_gb2312) - Print(Gb2312_to_unicode) - Print(Gb2312_to_utf8)

In Python3

004-python Basics-Character encoding and transcoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

004-python Basics-Character encoding and transcoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

004-python Basics-Character encoding and transcoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support