A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Python coding (14) and python Coding
I. Knowledge of character encoding
1. How the text editor accesses files (nodepad ++, pycharm, word)
Open the editor and start a process, which is in the memory. Therefore, the content written in the editor is also stored in the memory. After power failure, data is lost and therefore needs to be saved to the hard disk, click the Save button to fl data from the memory to the hard disk. At this point, we compile a py file (not executed), which is no different from writing other files. It is just a bunch of characters.
That is, when you do not click Save, all the content we write is written into the memory. This is important !! When we click Save, the content is flushed to the hard disk.
I did two things above: Writing content to the memory, flushing the memory from the memory to the hard disk. This is two processes.
2. How the python interpreter runs the py file, for example, python test. py
Phase 1: Start the python interpreter, which is equivalent to starting a text editor.
Stage 2: The python interpreter is equivalent to a text editor. Open the test. py file and read the content of the test. py file from the hard disk to the memory.
Stage 3: The python interpreter executes the code test. py that has just been loaded into the memory.
The python interpreter executes the py file in two steps: 1. Read the file to the memory, and 2. Explain the execution content.Ii. character encoding
To clarify character encoding, the first problem to be solved is: what is character encoding?
We all know that power is required for computers to work, that is to say, 'electric 'drives computers to work, and the characteristic of 'electric' is high and low level (high and low level is the binary number 1, low level refers to the binary number 0), that is, the computer only knows the number (010101 ). if we want to save the data, we must first process the data and finally convert it to 010101 for computer recognition.
Therefore, a process is required:
Character -------- (translation process) -------> Number
This process is actually how a character corresponds to a specific number standard, which is called character encoding.
So the question comes? As a coding solution, we have to solve two problems:
A. How are bytes grouped, such as 8 bits or 16 bits, also known as encoding units.
B. ing between encoding units and characters. For example, in an ASCII code, 65 is mapped to letter A in decimal format.
ASCII Code is one of the most popular encoding systems in the last century, at least in the West. It shows how the encoding unit in the ASCII code maps to characters.Iii. Development History of character encoding
Phase 1: modern computers originated in the United States, and the earliest birth was ASCII based on English considerations.
As computers become increasingly popular and competition among manufacturers becomes more intense, data conversion between different computer systems becomes very painful, and people get bored with the confusion caused by such customization. Eventually, the computer manufacturer developed a standard method to describe characters. They define to use a low 7-bit byte to represent characters, and create a comparison table as shown in to map the values of seven bits to one character. For example, the letter A is 65, and the letter c is 99 ,~ Is 126, and so on, the ASCII code is born. The original ASCII standard defines characters from 0 to 127, which can be exactly expressed in seven bits.
Why should we use 7 bits instead of 8 to represent a single character? I don't care. But one byte is eight bits, which means that one bit is not used, that is, the encoding from 128 to 255 is not specified by the person who sets the ASCII standard, these Americans do not know or even care about the rest of the world. People from other countries take this opportunity to start using encoding within the range of 128 to 255 to express characters in their own language. For example, 144 indicates that the Arabic ASCII code is "plain", while the Russian ASCII code is "plain. The problem with ASCII code is that although everyone is consistent with the use of characters 0-, there are many different interpretations of characters. You must tell the computer which ASCII code is used to correctly display the 128-255 characters.
Summary: ASCII, A Bytes represents one character (English character/all other characters on the keyboard), 1 Bytes = 8 bit, 8 bit can represent 0-2 ** 8-1 changes, that is, it can represent 256 characters. ASCII originally only uses the last seven digits and 127 digits. It can represent all the characters on the keyboard (English characters/all other characters on the keyboard ), later, in order to encode the Latin into the ASCII table, the maximum bit is also occupied.
Stage 2: GBK is customized for Chinese
GBK: 2Bytes represents one character. To meet the requirements of other countries, each country has customized its own encoding. Compile Japanese
Shift_JISSouth Korea makes up Korean
Stage 3: Unicode coding of Wanguo Codes
Later, some people began to think that too much encoding made the world too complicated and painful, so they sat together and shoot their heads to come up with a method: all the characters in the language are represented by the same character set, this is Unicode.
Unicode uses 2 bytes to represent a single character, which is 2 ** 16-1 = 65535 characters and can represent more than 60 thousand characters. Therefore, Unicode is compatible with the universal language. however, for all the texts in English, this encoding method is undoubtedly twice as much storage space (only one byte is required for English letters, expressed in two bytes, is a waste of space ). therefore, a UTF-8 is generated, and only 1 bytes is used for English characters, and 3 bytes is used for Chinese characters. UTF-8 is a very amazing concept, it beautifully achieves backward compatibility with ASCII code to ensure Unicode can be accepted by the public.
In the UTF-8, characters 0-are represented in 1 byte, using the same Encoding As the US-ASCII. This means that the documents written in the 1980 s were opened with a UTF-8 with no problem at all. Only 2, 3, or 4 bytes are allowed for the characters 128 and above. Therefore, UTF-8 is called variable length encoding. The following byte stream is as follows:
This byte stream represents the same character in ASCII and UTF-8: HELLO
As for the other UTF-16, It is not described here.
To sum up, unicode is simple and crude. All characters are 2 Bytes. The advantage is that character -----> numbers are converted quickly, but the disadvantage is that they occupy a large space.
UTF-8: accurate. Different characters are expressed with different lengths. The advantage is space saving. The disadvantage is that character-> Number conversion is slow, because each time, the length of Bytes required to calculate the characters can be accurately expressed.
Therefore,The encoding used in the memory is unicode and the space is used for time (the program must be loaded to the memory to run, so the memory should be as fast as possible );UTF-8 is used for hard disk or network transmission. Network I/O latency or disk I/O latency is much higher than UTF-8 conversion latency, and I/O should be used to save bandwidth as much as possible, ensure data transmission stability.
All programs will eventually be loaded into the memory, and the program will be stored in different countries on the hard disk in different encoding formats, however, in the memory, we use unicode in a unified and fixed manner to ensure compatibility with all nations (which is why computers can run programs in any country). That is why unicode is used for fixed memory, you may say that I can use UTF-8 for compatibility with other countries. Yes, it works properly. The reason why unicode is not necessarily more efficient than UTF-8 (uicode is fixed with 2 bytes encoding, UTF-8 needs to be calculated), But unicode is a waste of space. That's right. This is a way to use space for time, and it is stored on the hard disk or transmitted over the network, unicode must be converted to UTF-8. Because data transmission is stable and efficient, the smaller the data size, the more reliable the data transmission is. Therefore, unicode is converted to UTF-8 instead of unicode.Iv. Use of character encoding
Whatever the type of file, remember one thing:If the file is saved in any encoding, it is opened in any encoding mode.
Let's take a look at the Encoding Problems in python:
If you do not specify the header information in the python file #-*-coding: UTF-8-*-, use the default value: ascii in python2, and UTF-8 in python3 by default.
Read the code that has been loaded into the memory (unicode-encoded binary), and then execute the code. New memory space may be opened during execution, such as x = "hello"
The memory encoding uses unicode, which does not mean that all of the memory is unicode-encoded binary. Before the program is executed, the memory is indeed a unicode-encoded binary, for example, a row of x = "hello" is read from the file, where x, equals sign, quotation marks, and status are all the same. They are all common characters, both are stored in the binary format of unicode encoding and in the memory. however, during execution, the program will apply for memory (two spaces exist with the memory of the program code), which can store data in any encoding format, such as x = "hello ", it will be recognized as a string by the python interpreter, and will apply for memory space to store "hello", and then let x point to the memory address, at this time, the newly applied memory address is saved as unicode-encoded hello, if the code is changed to x = "hello ". encode ('utf-8'), then the newly applied memory space will store the UTF-8 encoded string hello.
When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmit it to the browser
If the encode encoding format of the server is UTF-8, the client receives the UTF-8 encoded binary data in the memory.V. Differences between Python2 and python3 Encoding
1. There are two types of string str and unicode in python2
When the python interpreter executes the code that generates the string (for example, s = 'line'), a new memory address is requested, then encode 'line' into the encoding format specified at the beginning of the file. This is already the result after encode, so s can only decode. If the encode is used again, an error is returned.
1 # _ * _ coding: gbk _ * _ 2 #! /Usr/bin/env python3 3 4 4 x = 'line' 5 5 # print x. encode ('gbk') # Error 6 6 6 print x. decode ('gbk') # result: Lin
In python2, str is the encoded result bytes, str = bytes. Therefore, in python2, the unicode character encoding result is str/bytes.
1 # coding: utf-82 s = 'line' # during execution, 'line' will be saved to the new memory space in the form of conding: UTF-8 3 4 print repr (s) # '\ xe6 \ x9e \ x97' Three Bytes, prove it is indeed a utf-85 print type (s) # <type 'str'> 6 7 s. decode ('utf-8') 8 # s. encode ('utf-8') # error, s is the encoded result bytes, so only decode
When the python interpreter executes the code that generates the string (for example, s = u 'line'), a new memory address is requested, then, store 'line' in unicode format to the new memory space. Therefore, s can only encode, not decode.
1 s = u 'line' 2 print repr (s) # U' \ u6797 '3 print type (s) # <type 'unicode '> 4 5 6 # s. decode ('utf-8') # error. s is unicode, so only encode7 s is supported. encode ('utf-8 ')
When printing data to a terminal, pay attention to some issues.
When the program is executed, for example, x = 'line'; print (x) # This step points x to the new memory space (not the memory space where the code is located) the memory is printed to the terminal, and the terminal is still running in the memory, so this printing can be understood as printing from memory to memory, that is, memory-> memory, unicode-> unicode.For unicode data, no matter how it is printed, it will not be garbled.The strings in python3 and the u'string' in python2 are unicode, so no garbled characters are printed in any way. On a windows terminal (the terminal is encoded as gbk, the file is encoded as UTF-8, and garbled characters are generated)
1 # verify the following printed results in pycharm and cmd 2 s = u 'line' # When the program is executed, the 'line' will be saved in unicode format. In the new memory space, 3 4 5 # s points to unicode, so it can be encoded into any format, no encode Error 6 s1 = s. encode ('utf-8') 7 s2 = s. encode ('gbk') 8 print s1 # print normal? 9 print s2 # print normal No 10 11 12 print repr (s) # U' \ u6797 '13 print repr (s1) # '\ xe6 \ x9e \ x97' encode a Chinese character UTF-8 with 3Bytes14 print repr (s2) # '\ xc1 \ xd6' encode a Chinese character gbk with 2Bytes15 16 print type (s) # <type 'unicode '> 17 print type (s1) # <type 'str'> 18 print type (s2) # <type 'str'>
2. There are also two string types in python3: str and bytes
Str type to unicode type
1 # coding: utf-82 s = 'line' # When the program is executed, you do not need to add u, 'line' will also be saved as unicode in the new memory space, 3 4 # s can be directly encode into any encoding format 5 s. encode ('utf-8') 6 s. encode ('gbk') 7 8 print (type (s) # <class 'str'>
1 # coding: UTF-8 2 s = 'line' # When the program is executed, you do not need to add u. The 'line' will also be saved as unicode in the new memory space, 3 4 # s can be directly encode into any encoding format 5 s1 = s. encode ('utf-8') 6 s2 = s. encode ('gbk') 7 8 9 10 print (s) # forest 11 print (s1) # B '\ xe6 \ x9e \ x97' in python3, print whatever it is 12 print (s2) # B '\ xc1 \ xd6' same as 13 14 print (type (s )) # <class 'str'> 15 print (type (s1) # <class 'bytes '> 16 print (type (s2) # <class 'bytes'>
Start building with 50+ products and up to 12 months usage for Elastic Compute Service