A knowledge reserve for understanding character encoding
1. How the text editor accesses the file (Nodepad++,pycharm,word)
Opening the editor opens a process that is in memory, so content written in the editor is also stored in memory, and data is lost after a power outage
So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive.
At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.
2. How the Python interpreter executes the py file, such as Python test.py
First stage: The Python interpreter starts, which is equivalent to launching a text editor
Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk
Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory
Summarize:
- The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor
- Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.
Two what is character encoding
Computers want to work must be energized, that is, ' electricity ' drives the computer to work, and the ' power ' is the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number
The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work
So you have to go through a process:
character--------(translation process)-------> Numbers
This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.
The history of three-character coding
Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII
ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters
ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)
Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied
Stage two: In order to satisfy Chinese, the Chinese have customized the GBK
Gbk:2bytes represents a character
In order to satisfy other countries, each country has to customize its own code
Japan put the Japanese Shift_JIS
in, South Korea to the Korean Euc-kr
in the
Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.
The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language
But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)
Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes
One thing to emphasize is:
Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large
Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent
- The encoding used in memory is Unicode, with space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)
- In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.
Four. Character encoding classification
The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers.
ASCII uses 1 bytes (8-bit binary) to represent one character
Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes
If all of our documents are in English, you can use Unicode more space than ASCII, which is inefficient in storage and transmission.
In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8
encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
character |
ASCII |
Unicode |
UTF-8 |
A |
01000001 |
00000000 01000001 |
01000001 |
In |
X |
01001110 00101101 |
11100100 10111000 10101101 |
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Five-character encoding using the 5.1 text editor Yiguoduan
5.1.2 Text Editor nodpad++
Summarize:
No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)
The core rule is that what code the file is stored in, and how it's coded to open it.
While the file encoding is used in the encoding method is the lower right corner of the encoding, and decoding is the use of the document at the beginning of the declaration of the encoding, the two codes are very prone to garbled when different.
5.2 Execution of the program
Python test.py (I'll emphasize again that the first step in executing test.py must be to read the contents of the file into memory first)
Phase one: Start the Python interpreter
Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory
At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code,
Can be viewed with sys.getdefaultencoding (), if you do not specify the header information #-*-coding:utf-8-*-in the Python file, then use the default
Default usage in Python2 in Ascii,python3 utf-8
Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"
The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,
Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory
However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.
For python3 such as
When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser
If the encoding format of the server-side encode is utf-8, the client in-memory receives the UTF-8 encoded binary as well.
The difference between 5.3 python2 and Python3 5.3.1 There are two types of strings in Python2 str and Unicode
STR type
When the Python interpreter executes the code that produces the string (for example, s= ' forest '), it requests a new memory address and then encode the ' forest ' to the encoding format specified at the beginning of the file, which is already the result of encode, so s can only decode
1 #_ *_coding:gbk_*_2 #!/usr/bin/env python3 4 x= ' forest ' 5 # Print X.encode (' GBK ') #报错6 print x.decode (' GBK ') #结果: Forest
So the important point is:
In Python2, STR is the encoded result bytes,str=bytes, so in python2, the result of Unicode character encoding is str/bytes
#coding: utf-8s= ' Forest ' #在执行时, ' forest ' will be saved in conding:utf-8 form to the new memory space in print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-8print Type (s) #<type ' str ' >s.decode (' Utf-8 ') # s.encode (' Utf-8 ') #报错, s for encoded results bytes, so only decode
Unicode type
When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode
Print to Terminal
Special instructions for print are:
When the program is executed, such as
x= ' Forest '
Print (x) #这一步是将x指向的那块新的内存空间 (not the memory space in which the code resides) is printed to the terminal, and the terminal is still running in memory, so this printing can be understood as printing from memory to memory, that is, memory,unicode-> Unicode
For data in Unicode format, no matter how it is printed, it is not garbled.
The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.
In the Pycharm
In the Windows terminal
However, there is another non-Unicode string in the Python2, at this time, print x, will be executed according to the terminal Code x.decode (' Terminal code '), after the Unicode, and then print, when the terminal encoding and the file at the beginning of the specified encoding inconsistent, garbled generated
In Pycharm (the terminal code is utf-8, the file is encoded as UTF-8, it is not garbled)
In Windows terminal (Terminal encoded as GBK, file encoded as Utf-8, garbled generated)
Study Questions
Verify the following print results in Pycharm and CMD, respectively
#coding: Utf-8s=u ' Forest ' #当程序执行时, ' forest ' will be saved in Unicode form in the new memory space #s points to Unicode, so it can be encoded in any format, will not be reported encode error s1=s.encode (' Utf-8 ') S2=s.encode (' GBK ') print S1 #打印正常否? Print S2 #打印正常否print repr (s) #u ' \u6797 ' Print repr (S1) # ' \xe6\x9e\x97 ' encode a kanji utf-8 with 3Bytesprint repr (S2) # ' \xc1\xd6 ' Encode a kanji GBK with 2Bytesprint type (s) #<type ' Unicode ' >print type (S1) #<type ' str ' >print type (s2) #<type ' str ' >
5.3.2 also has two string types in Python3 str and bytes
STR is Unicode
#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s. Encode (' Utf-8 ') s.encode (' GBK ') Print (type (s)) #<class ' str ' >
Bytes is bytes.
#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s1 =s.encode (' Utf-8 ') s2=s.encode (' GBK ') print (s) #林print (S1) #b ' \xe6\x9e\x97 ' in Python3, what is printed on what print (s2) #b ' \xc1\xd6 ' ibid. print (type (s)) #<class ' str ' >print (Type (S1)) #<class ' bytes ' >print (type (s2)) #<class ' bytes ' >
Python (character encoding)