I. Knowledge reserve for character coding
1. How the text editor accesses the file (Nodepad++,pycharm,word)
Opening the editor opens a process that is in memory, so the content written in the editor is also stored in memory, the data is lost after the power outage, so you need to save to the hard disk, click on the Save button, the data from memory to the hard disk. At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.
That is, when no click is saved, everything we write is written to memory. It's important to note this!! When we click Save, the content is only brushed to the hard drive.
It does two things: write the contents to memory, and swipe the memory from memory to the hard disk. This is a two process.
2. How the Python interpreter executes the py file, such as Python test.py
First stage: The Python interpreter starts, which is equivalent to launching a text editor
Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk
Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory
The Python interpreter executes a py file in two steps: 1. Read the file into memory, 2. Explain the execution.
Ii. Introduction to character encoding
To figure out the character encoding, the first thing to solve is: what is character encoding?
We all know that computers want to work must power, that is, ' electricity ' drives the computer to work, and the ' power ' is the characteristics of the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number (010101). If we want to save the data, First we have to do some processing of our data, and eventually we have to convert it to 010101 to make the computer recognize it.
So you have to go through a process:
character--------(translation process)-------> Numbers
This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.
So that's the problem? As an encoding scheme, there are two problems to solve:
A. How bytes are grouped, such as 8 bits or a set of bits, which is also known as the encoding unit.
B. The mapping between the encoding unit and the character. For example, in ASCII code, the decimal 65 is mapped to the letter A.
ASCII code is one of the most popular coding systems of the last century, at least in the west. Shows how the encoding units in the ASCII code are mapped to characters.
The history of character coding
Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII
As computers become more popular, competition between vendors becomes more intense, and the conversion of data between different computer systems gets very sore, and people are tired of the confusion caused by this customization. Eventually, computer manufacturers worked together to develop a standard method for describing characters. They define a low 7 bit of a byte to represent the character, and make the comparison table as shown to map the value of seven bits to one character. For example, the letter A is 65,c is 99,~ is 126, and so on, the ASCII code was born. The original ASCII standard defines a character from 0 to 127, so that it can be represented in just seven bits.
Why did you choose 7 bits instead of 8 to represent one character? I don't care. But a byte is 8 bits, which means that 1 bits are not used, that is, the code from 128 to 255 is not set by the ASCII standard, and these Americans are ignorant of or even indifferent to the rest of the world. People from other countries took the opportunity to start using codes from 128 to 255 to express their language characters. For example, 144 is in the ASCII code of the Arabs, and in the ASCII code of Russia? The problem with ASCII code is that although everyone agrees on the use of the 0-127 character, there are many and many different explanations for number 128-255. You must tell the computer which style of ASCII code to use to correctly display the character number 128-255.
Summary: ASCII, a bytes represents a character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters, ASCII originally used only the last seven digits, 127 digits, It is fully capable of representing all the characters on the keyboard (all other characters of the English character/keyboard), and later in order to encode the Latin into the ASCII table, the highest bit is also occupied.
Stage two: In order to satisfy Chinese, the Chinese have customized the GBK
Gbk:2bytes represents a character; in order to satisfy other countries, each country has to customize its own code. Japan put the Japanese Shift_JIS
in, South Korea to the Korean Euc-kr
in the
Phase 3:0 Country Code Unicode encoding
Later, someone began to think that too much coding caused the world to become too complex, so that the brain hurts, so we sit together and shoot the head to come up with a method: All language characters are expressed in the same character set, which is Unicode.
Unicode unification uses 2Bytes to represent a character, 2**16-1=65535, which can represent more than 60,000 characters, and thus is compatible with the universal language. But for the whole English text, this encoding method is undoubtedly one times more storage space (English letters only need one byte is enough, Expressed in two bytes, is undoubtedly a waste of space). Then produced the UTF-8, the English characters are only used in 1Bytes, the Chinese characters with 3bytes.utf-8 is a very stunning concept, it is a beautiful implementation of the ASCII code backwards compatibility, to ensure that Unicode can be accepted by the public.
In UTF-8, the characters of number 0-127 are represented by 1 bytes, using the same encoding as US-ASCII. This means that the document written in the 1980 's was opened with UTF-8 with no problems. Only characters 128th and above are represented by 2, 3, or 4 bytes characters. Therefore, UTF-8 is called variable-length encoding. The following byte stream is thus as follows:
0100100001000101010011000100110001001111
This byte stream represents the same character in ASCII and UTF-8: HELLO
As for the other UTF-16, there is no longer a narrative.
To sum up: Unicode: Simple rough, all characters are 2Bytes, the advantage is the character-----> Digital conversion speed, the disadvantage is that occupy space is large.
Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the characters need to bytes to be able to accurately represent.
Therefore, the encoding used in memory is Unicode, space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible); utf-8, network I/O latency, or disk i/in the hard disk or network transmission o Latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to conserve bandwidth and ensure the stability of data transmission.
All programs that eventually have to be loaded into memory, programs saved to hard drives in different countries in different encoding formats, but into memory we are compatible with all nations (the computer can run any country's program for this reason), unified and fixed using Unicode, which is why memory is fixed with Unicode, You may say that compatible with all nations I can use utf-8 ah, can, completely normal work, the reason is not sure that Unicode is more efficient than utf-8 AH (uicode fixed with 2 byte encoding, utf-8 need to calculate), but Unicode is more wasted space, yes, This is a way to use space for time, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8, because the transmission of data, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so all turned into utf-8 format, rather than Unicode.
Four, the use of character encoding
Regardless of the type of file, just remember one thing: What encoding the file is stored in, and what encoding it opens.
Let's take a look at the problem with coding in Python:
If you do not specify the header information #-*-coding:utf-8-*-in the Python file, use the default Python2 in default Ascii,python3 in the default use Utf-8
Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space, such as x= "Hello", during execution
Memory encoding uses Unicode, does not mean that memory is all Unicode encoded binary, before the program executes, memory is indeed Unicode encoded binary, such as reading from the file a line x= "Hello", where the X, equals, quotes, status is the same, Are all ordinary characters, and are stored in memory in Unicode encoded binary form. However, during the execution, the program will apply for memory (the memory that exists with the program code is two spaces) and can store data in any encoded format, such as x= "Hello", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to that memory address, at this time the new requested memory address is also Unicode encoded Hello, if the code is replaced by x= "Hello". Encode (' Utf-8 ') , then the UTF-8 encoded string Hello is stored in the newly requested memory space.
When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser
If the encoding format of the server-side encode is utf-8, the client in-memory receives the UTF-8 encoded binary
Five, Python2 and Python3 coding difference
1. There are two types of string in Python2 str and Unicode
STR type
When the Python interpreter executes the code that produces the string (for example, s= ' forest '), it requests a new memory address and then encodes the ' forest ' into the encoding format specified at the beginning of the file, which is already the result of encode, so s can only be decode. Again encode will be an error.
#_ *_coding:gbk_*_2 #!/usr/bin/env python34 x=' forest ' 5 # Print X.encode ('gbk') #报错 6 print x.decode (' gbk') #结果: Lam
In Python2, STR is the encoded result bytes,str=bytes, so in python2, the result of Unicode character encoding is str/bytes.
#coding: utf-8s='Lin'#在执行时,'Lin'will be taken to conding:utf-8 form saved to new memory space in print repr (s) #'\xe6\x9e\x97'Three bytes, proving to be a utf-.8print type (s) #<type'Str'>S.decode ('Utf-8') # S.encode ('Utf-8'#报错, S is bytes after the encoded result, so it can only decode
Unicode type
When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format into the new memory space, so s can only be encode and cannot be decode.
S=u' forest 'print repr (s) #u'\u6797'print type (s) # 'Unicode'># s.decode ('utf-8' ) #报错, S is Unicode, so only Encodes.encode ('utf-8'
Special Note:
When the data is to be printed to the terminal, you should pay attention to some problems.
When the program executes, such as: x= ' Forest ';p rint (x) #这一步是将x指向的那块新的内存空间 (non-code memory space) in the memory, print to the terminal, and the terminal is still running in memory, so this printing can be understood to print from memory to memory, that is, memory, memory, Unicode->unicode. for data in Unicode format, no matter how it is printed, it is not garbled. The strings in the Python3 and the U ' Strings ' in Python2 are Unicode, so the print is not garbled anyway. In Windows terminal (Terminal encoding is GBK, file encoding is utf-8, garbled)
#分别验证在pycharm中和cmd中下述的打印结果s=u'Lin'#当程序执行时,'Lin'will be stored in Unicode as a new memory space #s points to Unicode, which can be encoded in any format and will not report encode errors S1=s.encode ('Utf-8') S2=s.encode ('GBK') print S1 #打印正常否? Print S2 #打印正常否print repr (s) #u'\u6797'print repr (S1) #'\xe6\x9e\x97'Encode a Chinese character utf-8 with 3Bytesprint repr (S2) #'\xc1\xd6'encode a kanji GBK with 2Bytesprint type (s) #<type'Unicode'>print Type (S1) #<type'Str'>print type (s2) #<type'Str'>
2. There are also two kinds of string types in Python3 str and bytes
STR type becomes Unicode type
#coding: utf-8s=' forest ' #当程序执行时, no need to add u,' forest ' The new memory space is also saved in Unicode form, #s可以直接encode成任意编码格式s. Encode ('utf-8') S.encode ( 'gbk') print (type (s)) #<class' Str'>
Bytes Type
#coding: utf-8s='Lin'#当程序执行时, no need to add u,'Lin'will also be saved in the new memory space in Unicode form, #s可以直接encode成任意编码格式s1=s.encode ('Utf-8') S2=s.encode ('GBK') print (s) #林print (S1) #b'\xe6\x9e\x97'in Python3, what is printed on what print (s2) #b'\xc1\xd6'Ibid. print (type (s)) #<class 'Str'>print (Type (S1)) #<class 'bytes'>print (Type (s2)) #<class 'bytes'>
Python coding is a bit of a problem