Python's----------character encoding specific principles

Source: Internet
Author: User

1. Both the memory and the hard disk are used for storage.

CPU: Fast Speed

Hard drive: Permanently saved

2. How the text editor accesses the file (Nodepad++,pycharm,word)

Opening the editor allows you to start a process that is in memory, so the content written in the editor is also stored in memory, and the data is lost after the power outage. So you need to save on the hard disk, click the Save button or shortcut key, the memory of the data saved to the hard disk. At this point, we write the py file (when it is not executed), it is no different from the other files written, just write a bunch of characters.

The 3.python interpreter implements the principle of the Py file, such as Python test.py

First stage: The Python interpreter starts, which is equivalent to launching a text editor

The second stage: The Python interpreter is equivalent to a text editor to open test.py and read the test.py file contents into memory from the hard disk

Phase three: The Python interpreter executes code that has just been loaded into memory test.py (in which case the Python syntax is recognized when executing to a string, which opens up a memory space to hold the string)

Summary: The similarities and differences between the Python interpreter and the text editor

The same point: thePython interpreter interprets the execution file contents, so the Python interpreter has the ability to read the Py file, which is the same as the text editor

Different points: The text Editor reads the contents of the file into memory for display/editing, and the Python interpreter reads the contents of the file into memory for execution (recognizing the syntax of Python)

4. What is encoding?

Computers that want to work must have power, high and low levels (higher is the binary number 1, the low level is the binary number 0), which means the computer only knows the numbers. So how does a computer read human characters?

This has to go through a process:

 Character---------(translation process)-------------numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding .

5. The following two scenarios relate to character encoding issues:

1. The contents of a Python file are made up of a bunch of characters (when a python file is not executed)

The data type string in 2.python is made up of a string of characters (when the Python file executes)

6. History of character encoding

Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII

ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters

ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)

Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied

Stage two: In order to satisfy Chinese, the Chinese have customized the GBK

Gbk:2bytes represents a character, in order to meet other countries, each country has to customize their own code, Japan put the Japanese into Shift_JIS , South Korea to the Korean Euc-kr

Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.

The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language

But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)

Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes

It should be emphasized that:

Unicode: simple rough, many characters are 2Bytes, the advantage is the character-the conversion speed of the number, the disadvantage is that occupy space.

Utf-8: precision, variable length, the advantage is to save space, the disadvantage is that the conversion speed is slow, because each conversion needs to calculate how long bytes to be able to accurately represent.

1. The encoding used in memory is Unicode, space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)

2. Utf-8 in the hard drive or network transmission, ensure the stability of data transmission.

1 All programs are eventually loaded into memory, programs are saved to the hard drive in different countries in different encoding formats, but into memory we are in order to be compatible with all nations (the computer can run any country's program because of this), unified and fixed using Unicode, 2 This is why memory is fixed with Unicode, you might say that compatible with all nations I can use utf-8 ah, can, fully functional, the reason is not sure that Unicode is more efficient than utf-8 AH (uicode fixed with 2 byte encoding 3 , utf-8 need to calculate), but Unicode is a waste of space, yes, this is a way to use space for time, and storage to the hard disk, or network transmission, all need to turn Unicode into Utf-8,4 Because of the transmission of data, the pursuit of stability, high efficiency, the smaller the amount of data transmission is more reliable, so all turned into utf-8 format, rather than Unicode.
View Code

Seven, character encoding conversion 

Unicode------>encode (encoding)-------->utf-8

Utf-8---------->decode--------->unicode

Files from memory brush to hard disk operations for short files

Files read from hard disk to memory for short read files

Garbled: The file has been garbled or stored in the file is not garbled while reading the file garbled

Summarize:

No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that what code the file is stored in, and how it's coded to open it.

Eight, the text editor of the Python interpreter

The file test.py is saved in GBK format with the following contents:

x= ' Forest '

Whether it is

Python2 test.py

Still is

Python3 test.py

will be error ( because python2 default ascii,python3 default Utf-8)

Unless you specify #coding:gbk at the beginning of the file

IX. implementation of Procedures

Python3 test.py or Python2 test.py (the first step in performing test.py must be to read the contents of the file into memory first)

Phase one: Start the Python interpreter

Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory

At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code, Python2 by default using Ascii,python3 in Utf-8

Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"

The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,

Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory

However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.

Ten, the difference between Python2 and Python3

There are two types of string in Python2 str and Unicode

In Python2, STR is the encoded result bytes, so in Python2, the result of Unicode character encoding is str/bytes

1 #Coding:utf-82s='Lin' #when executed, the ' forest ' is saved to the new memory space in the form of Conding:utf-83 4 PrintRepr (s)#' \xe6\x9e\x97 ' three bytes, which proves to be indeed Utf-85 PrintType (s)#<type ' str ' >6 7S.decode ('Utf-8')8 #s.encode (' Utf-8 ') #报错, S is the encoded result bytes, so only decode
View Code

When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode

1 s=u' forest '2print#u ' \u6797 '3  Print#<type ' Unicode ' >456#  S.decode (' Utf-8 ') #报错, S is Unicode, so it can only encode7 s.encode ('utf-8'
View Code

For data in Unicode format, no matter how it is printed, it is not garbled.

The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.

There are also two kinds of string types in Python3 str and bytes\

STR is Unicode

1 #Coding:utf-82s='Lin' #when the program executes, you do not need to add u, ' Forest ' will also be in Unicode form to save the new memory space,3 4 #s can be encode directly into any encoding format5S.encode ('Utf-8')6S.encode ('GBK')7 8 Print(Type (s))#<class ' str ' >
View Code

Python's----------character encoding specific principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.