A knowledge reserve for understanding character encoding
1. How the text editor accesses the file (Nodepad++,pycharm,word)
Opening the editor opens a process that is in memory, so content written in the editor is also stored in memory, and data is lost after a power outage
So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive.
At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.
2. How the Python interpreter executes the py file, such as Python test.py
First stage: The Python interpreter starts, which is equivalent to launching a text editor
Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk
Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory
Summarize:
- The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor
- Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.
Two what is character encoding
Computers want to work must be energized, that is, ' electricity ' drives the computer to work, and the ' power ' is the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number
The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work
So you have to go through a process:
character--------(translation process)-------> Numbers
This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.
The history of three-character coding
Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII
ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters
ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)
Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied
Stage two: In order to satisfy Chinese, the Chinese have customized the GBK
Gbk:2bytes represents a character
In order to satisfy other countries, each country has to customize its own code
Japan put the Japanese Shift_JIS
in, South Korea to the Korean Euc-kr
in the
Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.
The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language
But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)
Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes
One thing to emphasize is:
Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large
Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent
- The encoding used in memory is Unicode, with space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)
- In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.
Four. Character encoding classification (easy to understand)
The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols
Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.
And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters
So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.
The Japanese have set their own shift_jis codes.
Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)
At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.
So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)
ASCII uses 1 bytes (8-bit binary) to represent one character
Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes
Cases:
The letter x, denoted by ASCII is decimal 120, binary 0111 1000
Chinese characters are 中
beyond the ASCII encoding range, Unicode encoding is decimal 20013
, binary 01001110 00101101
.
The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard
This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient
In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8
encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
character |
ASCII |
Unicode |
UTF-8 |
A |
01000001 |
00000000 01000001 |
01000001 |
In |
X |
01001110 00101101 |
11100100 10111000 10101101 |
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Five-character encoding using the 5.1 text editor Yiguoduan
5.1.2 Text Editor nodpad++
Analysis process? What is garbled
Files from memory brush to hard disk operations for short files
Files read from hard disk to memory for short read files
Garbled one: files are garbled when they are stored
Save the file, because the document has the text of each country, we shiftjis to save,
In essence, the writing of the Open function can be tested, F=open (' A.txt ', ' W ', encodig= ' Shift_JIS ') due to the lack of correspondence in the ShiftJIS and the failure of the storage in other countries.
F.write (' What do you see て\n ') # ' What do you see ' because there is no correspondence in ShiftJIS to save success, only ' how to seeて\n' can succeed
But when we use the file editor to save the time, the editor will help us do the conversion, to ensure that the Chinese can also be used ShiftJIS storage (hard to save, it must be garbled), which led to the file stage has been garbled
In this case, when we open the file with ShiftJIS, Japanese can display normally, while Chinese is garbled.
Or, when you save the file:
F=open (' a.txt ', ' WB ') f.write (' How to see て\n '. Encode (' Shift_JIS ')) f.write (' You're worried '. Encode (' GBK ')) f.write (' What are you worried about '). Encode ( ' Utf-8 ')) F.close ()
Opening a file with any encoding a.txt the remaining two problems that are not displayed correctly
Garbled two: When the file is not garbled and read the file garbled
Save the file with Utf-8 encoding, to ensure that compatible with all nations, not garbled, and read the file when the wrong decoding method, such as GBK, then in the reading stage garbled, read the stage garbled is can be resolved, select the correct decoding method is OK, and the file is garbled, it is a kind of data corruption.
5.1.3 Text Editor Pycharm
Save in GBK format
Open in Utf-8 format
Analysis process?
Summarize:
No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)
The core rule is that what code the file is stored in, and how it's coded to open it.
5.2 Execution of the program
Python test.py (I'll emphasize again that the first step in executing test.py must be to read the contents of the file into memory first)
Phase one: Start the Python interpreter
Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory
At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code,
Can be viewed with sys.getdefaultencoding (), if you do not specify the header information #-*-coding:utf-8-*-in the Python file, then use the default
Default usage in Python2 in Ascii,python3 utf-8
Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"
The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,
Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory
However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.
For python3 such as
When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser
If the encoding format of the server-side encode is utf-8, the client in-memory receives the UTF-8 encoded binary as well.
The difference between 5.3 python2 and Python3 5.3.1 There are two types of strings in Python2 str and Unicode
STR type
When the Python interpreter executes the code that produces the string (for example, s= ' forest '), it requests a new memory address and then encode the ' forest ' to the encoding format specified at the beginning of the file, which is already the result of encode, so s can only decode
1 #_ *_coding:gbk_*_2 #!/usr/bin/env python3 4 x= ' forest ' 5 # Print X.encode (' GBK ') #报错6 print x.decode (' GBK ') #结果: Forest
So the important point is:
In Python2, STR is the encoded result bytes,str=bytes, so in python2, the result of Unicode character encoding is str/bytes
#coding: utf-8s= ' Forest ' #在执行时, ' forest ' will be saved in conding:utf-8 form to the new memory space in print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-8print Type (s) #<type ' str ' >s.decode (' Utf-8 ') # s.encode (' Utf-8 ') #报错, s for encoded results bytes, so only decode
Unicode type
When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode
Print to Terminal
Special instructions for print are:
When the program is executed, such as
x= ' Forest '
Print (x) #这一步是将x指向的那块新的内存空间 (not the memory space in which the code resides) is printed to the terminal, and the terminal is still running in memory, so this printing can be understood as printing from memory to memory, that is, memory,unicode-> Unicode
For data in Unicode format, no matter how it is printed, it is not garbled.
The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.
In the Pycharm
In the Windows terminal
However, there is another non-Unicode string in the Python2, at this time, print x, will be executed according to the terminal Code x.decode (' Terminal code '), after the Unicode, and then print, when the terminal encoding and the file at the beginning of the specified encoding inconsistent, garbled generated
In Pycharm (the terminal code is utf-8, the file is encoded as UTF-8, it is not garbled)
In Windows terminal (Terminal encoded as GBK, file encoded as Utf-8, garbled generated)
Study Questions
Verify the following print results in Pycharm and CMD, respectively
#coding: Utf-8s=u ' Forest ' #当程序执行时, ' forest ' will be saved in Unicode form in the new memory space #s points to Unicode, so it can be encoded in any format, will not be reported encode error s1=s.encode (' Utf-8 ') S2=s.encode (' GBK ') print S1 #打印正常否? Print S2 #打印正常否print repr (s) #u ' \u6797 ' Print repr (S1) # ' \xe6\x9e\x97 ' encode a kanji utf-8 with 3Bytesprint repr (S2) # ' \xc1\xd6 ' Encode a kanji GBK with 2Bytesprint type (s) #<type ' Unicode ' >print type (S1) #<type ' str ' >print type (s2) #<type ' str ' >
5.3.2 There are two types of string in Python three, str and bytes
STR is Unicode
#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s. Encode (' Utf-8 ') s.encode (' GBK ') Print (type (s)) #<class ' str ' >
Bytes is bytes.
#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s1 =s.encode (' Utf-8 ') s2=s.encode (' GBK ') print (s) #林print (S1) #b ' \xe6\x9e\x97 ' in Python3, what is printed on what print (s2) #b ' \xc1\xd6 ' ibid. print (type (s)) #<class ' str ' >print (Type (S1)) #<class ' bytes ' >print (type (s2)) #<class ' bytes ' >
Python character encoding