Full-stack python development-Day6 character encoding, python-day6

Source: Internet
Author: User

Full-stack python development-Day6 character encoding, python-day6
Full-stack python development-Day6 character encoding I. Knowledge about character encoding

I. Basic computer knowledge

 

Ii. Principle of file access in the Text Editor (nodepad ++, pycharm, word)

#1. Open the editor and start a process in the memory. Therefore, the content written in the editor is also stored in the memory, data loss after power failure #2. To save the data permanently, click the Save button: The editor clears the memory data to the hard disk. #3. Compile a py file (not executed). It is no different from writing other files. It is just a bunch of characters.

3. How the python interpreter runs the py file, for example, python test. py

# Phase 1: Start the python interpreter, which is equivalent to starting a text editor # Phase 2: The python interpreter is equivalent to a text editor to open test. py file, test. the file content of py is read into the memory (small review: the interpretation of pyhon determines that the interpreter only cares about the file content and does not care about the file suffix) # Stage 3: the python interpreter just loaded to the memory for testing. py code (ps: in this phase, when the code is actually executed, it will recognize the python syntax and execute the code in the file. When it is executed to name = "egon, will open up memory space to store the string "egon ")

Iv. Summarize the similarities and differences between the python interpreter and file-based editing

#1. Similarities: the python interpreter interprets the content of the execution file, so the python interpreter can read The py file, which is the same as the text editor #2. Differences: after the text editor reads the file content into the memory, it simply displays or edits the file content, ignoring the python syntax. After the python interpreter reads the file content into the memory, this is not to show you what python code is written, but to execute python code and recognize python syntax.
2-character encoding

1. What is character encoding?

If a computer wants to work, it must be powered on. That is to say, the characteristics of the Computer determine the characteristics of the computer. The characteristics of electricity are high and low levels (humans logically map the binary number 1 to the high level, and the binary number 0 to the low level). The magnetic properties of the disk are also the same. Conclusion: computers only recognize numbers. When we use computers, it uses characters that humans can understand (the result of programming in advanced languages is nothing more than writing a pile of characters in a file). How can computers read human characters? It must go through a process: # character -------- (translation process) -------> Number # This process is actually how a character corresponds to a specific number standard, which is called character encoding.

2. character encoding is involved in the following two scenarios:

#1. The content in a python file is composed of a bunch of characters. All accesses involve character encoding issues (the python file is not executed, and the first two stages belong to this category) #2. The data type string in python is composed of a string of characters (the third stage is used when the python file is executed)

Iii. Development History and classification of character encoding (understanding)

The computer was invented by Americans and its earliest character encoding was ASCII, which only stipulated the correspondence between English letters and numbers and some special characters and numbers. Up to 8 bits can be used for representation (one byte), that is, 2 ** 8 = 256. Therefore, the ASCII code can only represent up to 256 symbols.

Of course, we can use English in programming languages, and ASCII is enough. However, when processing data, different countries have different languages. Japanese will add Japanese to their programs, chinese will join Chinese.

To express Chinese, a single byte table is used to represent a man. It is impossible to complete the table (even the pupils know more than two thousand Chinese characters). There is only one solution, A byte is represented by a byte larger than 8-bit binary. The more digits it represents, the more changes it represents. In this way, as many Chinese characters as possible can be expressed.

Therefore, the Chinese have defined their own standard gb2312 encoding, and defined the correspondence between character-> number including Chinese characters.

The Japanese have defined their own Shift_JIS encoding.

Koreans have defined their own Euc-kr code (In addition, the Koreans say that computers were invented by them and that the world uses Korean code in a unified manner, but the people of the world have not taken care of them)

 

At this time, the problem arises. Zhou, who is proficient in 18 languages, writes a document in 8 languages modestly. according to the standards of which country, garbled characters will occur (because all the standards at the moment only stipulate the correspondence between characters in their country's text and numbers. If a country's encoding format is used, the text in other countries will be garbled during parsing)

Therefore, unicode came into being because of the urgent need for a world standard (which can contain all the languages of the world) (Korean people expressed their dissatisfaction and were useless)

Ascii represents a character in 1 byte (8-bit binary)

Unicode usually uses two bytes (16-bit binary) to represent one character. For uncommon characters, it must use four bytes.

Example:

The letter x, represented in ascii in decimal 120, binary 0111 1000

Chinese charactersMediumIt is beyond the range of ASCII encoding and uses Unicode encoding in decimal format.20013, Binary01001110 00101101.

The letter x represents binary 0000 0000 0111 1000 in unicode format. Therefore, unicode is compatible with ascii and universal. It is the world standard.

 

At this time, the garbled problem disappears. We use all the documents, but the new problem arises. If all our documents are in English, unicode will consume twice as much space as ascii, inefficient in storage and transmission

In the spirit of saving, Unicode encoding is converted into a variable-length encoding.UTF-8Encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:

Character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
Medium X 01001110 00101101 11100100 10111000 10101101

From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.

Iv. Summary The development of character encoding can be divided into three stages (important)

Based on the current situation, the encoding in the memory is fixed to unicode. The only thing we can change is the character encoding corresponding to the hard disk.
At this point, you may think that if we use unicode encoding in the software development process in the future, isn't it all unified? Your Idea of unification is correct, however, we cannot use unicode encoding to compile program files, because when the entire article is in English, the space consumed is almost doubled, in this way, when the software reads data into the memory or writes data to the disk, it will increase the number of I/O operations to reduce the execution efficiency of the program. Therefore, in the future, we should use a more precise UTF-8 character encoding when writing program files (using 1 bytes to store English and 3 bytes to store Chinese characters). Again, we emphasize that the encoding in the memory is fixed with unicode.
1. When saving data to a disk, unicode needs to be converted into a more precise Format. UTF-8: The full name is Unicode Transformation Format, which controls the data volume to the most streamlined

2. When reading the memory, convert UTF-8 to unicode.
Therefore, we need to make it clear that unicode is used in the memory to be compatible with Wanguo software. Even if the hard disk contains software compiled by various countries, unicode also has a corresponding ing relationship, but in the current development, programmers generally use UTF-8 encoding. It is estimated that, when all the old software will be eliminated one day in the future, it will become: Memory UTF-8 <-> hard disk UTF-8 format.

Three character encoding application-file editor 3.1 Text Editor-nodpad ++

 

3.2 Text Editor-pycharm

Open in UTF-8 format (select reload)

3.3 text editor python Interpreter
File test. py is saved in gbk format and the content is: x = 'line' whether it is python2 test. py or python3 test. py reports an error (because python2 defaults to ascii and python3 defaults to UTF-8) Unless specified at the beginning of the file # coding: gbk
3.4 Conclusion

!!! Two important points are summarized !!!

#1. The core principle of ensuring that characters are encoded according to the standard, the standard here refers to character encoding #2. All characters written in the memory are considered unicode encoding. For example, if we open the editor and enter "you ", we can't say "you" is a Chinese character. At this time, it is only a symbol, which may be used in many countries, the style of this word may be different based on the input method we use. Only when we save it to a hard disk or transmit it over the network can we determine whether "you" is a Chinese character or Japanese character. This is the process of converting unicode to other encoding formats.

Unicode -----> encode --------> UTF-8

UTF-8 --------> decode ----------> unicode

# Supplement: When browsing the Web page, the server will convert the dynamically generated Unicode content to the UTF-8 and then transmit it to the browser if the server encode encoding format is UTF-8, what is received in the client memory is also the result of UTF-8 encoding.

 

4-character encoding application-three phases of python program execution in python4.1

Python test. py (the first step to execute test. py is to first read the file content into the memory)

The content of the test. py file is saved in gbk format and the content is:

Phase 1: Start the python Interpreter

Phase 2: The python interpreter is a text editor that opens the file test. py, which reads the content of test. py from the hard disk to the memory.

In this case, the python interpreter reads test. the first line of py content, # coding: UTF-8, to determine the encoding format to read into the memory, this line is used to set the encoding format used by the python interpreter. You can use sys. getdefaultencoding (). If the header information is not specified in the python file #-*-coding: UTF-8-*-, the default value of python2 is ascii by default, and that of python3 is UTF-8 by default.

 

Correct: Specify the file header in test. py. The character encoding must be gbk,

# Coding: Hello, gbk.

Phase 3: Read the code that has been loaded into the memory (unicode encoding format), and then execute the code. New memory space may be opened during execution, such as x = "egon"

The memory encoding uses unicode, which does not mean that all the memory is unicode. Before the program is executed, the memory is indeed unicode. For example, if a row of x = "egon" is read from the file ", the x, equal sign, quotation marks, and status values are the same. They are all common characters and are stored in the memory in unicode format. However, during execution, memory will be applied (two spaces exist with the program code memory) to store the value of the python data type, the python string type involves the concept of character, such as x = "egon". It is recognized as a string by the python interpreter and will apply for memory space to store the value of the string type, the encoding of the value of the string type is identified, which is related to the python interpreter, and the string types of python2 and python3 are different.
4.2 differences between python2 and python3 string types

One has two string types in python2: str and unicode

Str type

When the python interpreter executes the code that generates the string (for example, x = 'shanghai'), it applies for a new memory address and then encodes 'shanghai' into the encoding format specified at the beginning of the file.

To view the real format of x in the memory, you can put it into the list and then print it, instead of printing it directly, because the direct print () will automatically convert the encoding, which we will talk about later.

# Coding: gbkx = 'y = 'bottom 'print ([x, y]) # [' \ xc9 \ xcf ', '\ xcf \ xc2'] # \ x indicates a hexadecimal number. Here, c9cf has a total of four hexadecimal numbers, one hexadecimal number and four bits, the four hexadecimal numbers are 16 bits, that is, two Bytes. This proves that 2 Bytes is used for Chinese characters encoded according to gbk.
print(type(x),type(y)) #(<type 'str'>, <type 'str'>)

The key to understanding character encoding !!!

Data in the memory is usually expressed in hexadecimal notation. Two-Bit hexadecimal notation represents a byte, for example, \ xc9, representing two hexadecimal notation and one byte.

Two bytes are required for gbk to store Chinese characters, and one bytes is required for saving English letters. How is this achieved ???!!!

Gbk uses the first bits of each bytes as the flag. If the flag is 1, it indicates a Chinese character. If the flag is 0, it indicates an English character.

X = convert 'Hello a' to gbk format binary bits 8bit + 8bit + 8bit + 8bit + 8bit = (1 + 7bit) + (1 + 7bit) + (0 + 7bit) + (1 + 7bit) + (1 + 7bit)

In this way, the computer reads from left to right:

# Reading the first flag in the first two parentheses in a row is 1, which constitutes a midday character: You # reading the first flag in the third bracket is 0, then the 8bit represents an English character: a # When the first flag in the last two parentheses is 1, it constitutes a midday character: Good

That is to say, each Bytes is reserved for the number of valid digits that we use to store a real positive value is only 7 bits, while only the valid seven bits are stored in the unicode table. The first bits are related to the specific encoding, that is, in unicode, the gbk format is:

(7bit)+(7bit)+(7bit)+(7bit)+(7bit)

 

According to the Translation results, we can go to the unicode correspondence between Chinese characters to query: Link: https://pan.baidu.com/s/1dEV3RYp

 

We can see that the gbk (G0 represents gbk) encoding corresponding to "Is 494F, that is, our result, and the corresponding unicode encoding is 4E0A, we can convert gbk --> decode --> unicode

# Coding: gbkx = 'shanghai '. decode ('gbk') y = 'downlink '. decode ('gbk') print ([x, y]) # [U' \ u4e0a', U' \ u4e0b']

Unicode type

When the python interpreter executes the code that generates the string (for example, s = u 'line'), a new memory address is requested, then, store the 'line' in unicode format to the new memory space. Therefore, s can only encode, not decode.

# Coding: gbkx = u'shanghai' # equivalent to x = 'shanghai '. decode ('gbk') y = U' bottom '# equivalent to y = 'bottom '. decode ('gbk') print ([x, y]) # [U' \ u4e0a', U' \ u4e0b']
print(type(x),type(y)) #(<type 'unicode'>, <type 'unicode'>)

Print to Terminal

Note the following for print:

When the program is executed, such

X = 'up' # Under gbk, the string is stored as \ xc9 \ xcf

Print (x) # This step prints the memory in the new memory space (not the memory space where the code is located) pointed by x to the terminal, it is reasonable to say that printing should be based on what is stored, but printing \ xc9 \ xcf immediately forces programmers who are not familiar with python coding. Therefore, he made his own claim, in print (x), use the terminal encoding format to convert \ xc9 \ xcf in the memory into characters for display. In this case, the terminal encoding must be gbk; otherwise, the original content cannot be displayed normally: upper

For unicode data, no matter how it is printed, it will not be garbled.

Unicode is so good that it won't contain garbled characters. Why is python2 so awkward? At the time of the birth of python, unicode was not as popular as it is today. Obviously, you can see good things. You have long seen it, and you have saved str directly as unicode in python3, we define a str, which does not require the u prefix. It is a unicode. Why?

 

In python3, there are also two string types, str and bytes.

Str is unicode

# Coding: gbkx = 'up' # When the program is executed, you do not need to add u. the 'up' is also saved as unicode in the new memory space, print (type (x) # <class 'str'> # x can be directly encoded into any encoding format print (x. encode ('gbk') # B '\ xc9 \ xcf' print (type (x. encode ('gbk') # <class 'bytes '>

It is very important to see x. the result of encode ('gbk') \ xc9 \ xcf is the str type value in python2, while python3 is the bytes type and python2 is the str type.

So I had a bold speculation: The str type in python2 is the bytes type of python3, so I checked the str () source code of python2 and found

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.