Python Full stack development "sixth" Python character encoding

Last Update:2018-02-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Both the memory and the hard disk are used for storage.

CPU: Fast Speed

Hard drive: Permanently saved

2. How the text editor accesses the file (Nodepad++,pycharm,word)

Opening the editor allows you to start a process that is in memory, so the content written in the editor is also stored in memory, and the data is lost after the power outage. So you need to save on the hard disk, click the Save button or shortcut key, the memory of the data saved to the hard disk. At this point, we write the py file (when it is not executed), it is no different from the other files written, just write a bunch of characters.

The 3.python interpreter implements the principle of the Py file, such as Python test.py

First stage: The Python interpreter starts, which is equivalent to launching a text editor

The second stage: The Python interpreter is equivalent to a text editor to open test.py and read the test.py file contents into memory from the hard disk

Phase three: The Python interpreter executes code that has just been loaded into memory test.py (in which case the Python syntax is recognized when executing to a string, which opens up a memory space to hold the string)

Summary: The similarities and differences between the Python interpreter and the text editor

The same point: the Python interpreter interprets the execution file contents, so the Python interpreter has the ability to read the Py file, which is the same as the text editor

Different points: The text editor reads the contents of the file into memory for display/editing, and the Python interpreter reads the contents of the file into memory for execution (recognizing the syntax of Python)

4. What is encoding?

Computers that want to work must have power, high and low levels (higher is the binary number 1, the low level is the binary number 0), which means the computer only knows the numbers. So how does a computer read human characters?

This has to go through a process:

Character---------(translation process)-------------numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.

5. The following two scenarios relate to character encoding issues:

1. The contents of a Python file are made up of a bunch of characters (when a python file is not executed)

The data type string in 2.python is made up of a string of characters (when the Python file executes)

6. History of character encoding

Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII

ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters

ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)

Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied

Stage two: In order to satisfy Chinese, the Chinese have customized the GBK

Gbk:2bytes represents a character, in order to meet other countries, each country has to customize their own code, Japan put the Japanese into Shift_JIS , South Korea to the Korean Euc-kr

Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.

The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language

But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)

Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes

It should be emphasized that:

Unicode: simple rough, many characters are 2Bytes, the advantage is the character-the conversion speed of the number, the disadvantage is that occupy space.

utf-8: precision, variable length, the advantage is space-saving, the disadvantage is that the conversion speed is slow, because each conversion needs to calculate how long it takes bytes to be able to accurately represent.

1. The encoding used in memory is Unicode, space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)

2. Utf-8 in the hard drive or network transmission, ensure the stability of data transmission.

1 All programs, eventually loaded into memory, the program saved to the hard disk different countries in different encoding format, but into memory we in order to be compatible with all nations (the computer can run any country's program because of this), unified and fixed use unicode,2 This is why memory is fixed with Unicode, You might say that compatible nations I can use utf-8 ah, can, completely normal work, the reason is not sure that the Unicode is more efficient than utf-8 AH (uicode fixed with 2 bytes encoding 3, utf-8 need to calculate), but Unicode is more wasted space, yes, This is the use of space for the time of a practice, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8,4 because of the data transmission, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so they are converted to UTF-8 format, rather than Unicode.

1 All programs, eventually loaded into memory, the program saved to the hard disk different countries in different encoding format, but into memory we in order to be compatible with all nations (the computer can run any country's program because of this), unified and fixed use unicode,2 This is why memory is fixed with Unicode, You might say that compatible nations I can use utf-8 ah, can, completely normal work, the reason is not sure that the Unicode is more efficient than utf-8 AH (uicode fixed with 2 bytes encoding 3, utf-8 need to calculate), but Unicode is more wasted space, yes, This is the use of space for the time of a practice, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8,4 because of the data transmission, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so they are converted to UTF-8 format, rather than Unicode.

Seven, character encoding conversion　

Unicode------>encode (encoding)-------->utf-8

Utf-8---------->decode--------->unicode

Files from memory brush to hard disk operations for short files

Files read from hard disk to memory for short read files

Garbled: The file has been garbled or stored in the file is not garbled while reading the file garbled

Summarize:

No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that what code the file is stored in, and how it's coded to open it.

Eight, the text editor of the Python Interpreter

The file test.py is saved in GBK format with the following contents:

x= ' Forest '

Whether it is

Python2 test.py

Still is

Python3 test.py

will be error (because python2 default Ascii,python3 default Utf-8)

Unless you specify #coding:gbk at the beginning of the file

IX. implementation of Procedures

Python3 test.py or Python2 test.py (the first step in performing test.py must be to read the contents of the file into memory first)

Phase one: Start the Python interpreter

Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory

At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code, Python2 by default using Ascii,python3 in Utf-8

Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"

The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,

Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory

However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.

Ten, the difference between Python2 and Python3

There are two types of string in Python2 str and Unicode

In Python2, STR is the encoded result bytes, so in Python2, the result of Unicode character encoding is str/bytes

#coding: Utf-8
S= ' Forest ' #在执行时, ' forest ' will be saved in the new memory space in the form of Conding:utf-8

Print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-8
Print type (s) #<type ' str ' >

S.decode (' Utf-8 ')
# s.encode (' Utf-8 ') #报错, S is the encoded result bytes, so only decode

1 #coding: utf-82 s= ' Forest ' #在执行时, ' forest ' will be saved in the form of Conding:utf-8 to the new memory Space 3 4 print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-85 Print type (s) #<type ' str ' >6 7 s.decode (' Utf-8 ') 8 # s.encode (' Utf-8 ') #报错, s for encoded results bytes, so only decode

When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode

S=u ' Forest '
Print repr (s) #u ' \u6797 '
Print type (s) #<type ' Unicode ' >

# s.decode (' Utf-8 ') #报错, S is Unicode, so you can only encode
S.encode (' Utf-8 ')

For data in Unicode format, no matter how it is printed, it is not garbled.

The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.

There are also two kinds of string types in Python3 str and bytes\

STR is Unicode

#coding: Utf-8
S= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be in Unicode form to save the new memory space,

#s可以直接encode成任意编码格式
S.encode (' Utf-8 ')
S.encode (' GBK ')

Print (type (s)) #<class ' str ' >

1 #coding: utf-82 s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be in Unicode form to save the new memory space, 3 4 #s可以直接encode成任意编码格式5 s.encode (' Utf-8 ') 6 S.encode (' GBK ') 7 8 print (type (s)) #<class ' str ' >

Python Full stack development "sixth" Python character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Full stack development "sixth" Python character encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support