Python's character encoding

Source: Internet
Author: User

(a) What is character encoding :

All the data in the computer, including files, pictures, videos, audio, etc. are stored in a binary way, and the computer can only recognize machine languages such as 0,1. And we want the computer to work for us so that the computer can recognize the instructions we send to it, then we need a way to translate our human language into a language that the computer can understand. The process of translating is actually the standard of how a character corresponds to a particular number, which is called a character encoding.

(b) The development of character encoding:

Stage One, Asscii code: A bytes represents a character (English character and all other characters on the keyboard) 1bytes=8bit,8bit can represent the 2**8-1, altogether is 256 kinds of change.

Stage two, contending: Chinese people in order to meet the Chinese characters, the formulation of "GBK", the Japanese developed a "shift _gis", the Korean people developed a "Euc-kr"等等。

Stage three, the world is one: each country has the standards of each country, there will inevitably be conflicts, in the multi-language mixed text, will inevitably appear garbled phenomenon. At this point, there is a universal code that can be compatible with each country code (UICODE), UICode Unified with 2bytes for a character, altogether can represent 2**16-1=65535 kind of change. However, this encoding method for the entire text is English, is undoubtedly a waste of space, English because 1bytes can represent a character. Therefore, a variable-length character encoding is produced, UFT-8, which specifies that the English language is expressed in 1bytes Chinese with 3bytes. But this character encoding is a recognition process, and the speed is much slower than the simple rough uicode. This is the time and space to make a trade-offs. Computer running program, all need to load the data into memory to run, this time in order not to appear garbled phenomenon, the provision in memory use UICode, and in the hard disk because of different countries can use uft-8 to save Storage data.

(c) Character encoding using:

Character encoding issues are covered in the following two scenarios

1, the contents of a Python file are composed of a bunch of characters (when the file is not executed)

2, the data type in Python is composed of a bunch of strings (when the file executes)

When the file is not executed: The normal file Store opens the process: first the text editor edits the content (which is done in memory) and then saves it (saved on the hard drive) The second reading process is the opposite, the data is loaded from the hard disk into memory, and then the text editor is printed to us. So the python file, the code we wrote before it was run without any distinction between these normal documents, is a bunch of characters. Use a simple diagram to describe the process.

Unicode-----------"encode------------" Utf-8

Utf-8-------------"Decode-------------" Unicode

python file execution process, first start the Python interpreter, and then load the Python file, interpreter identification file. It is important to note that the default character encoding in Python2 is ASCII, while the default in Python3 is uft-8. Windows terminal is "GBK".

Garbled phenomenon: The first time to edit the file is in memory editing, memory is UICode encoding, storage is the data to the hard disk, to Shift-gis save, so the time of the save has been wrong, so when the opening will produce garbled, this garbled is no way to change , if the process of file reading is garbled, you can choose the correct decoding method is OK, and the file is garbled, it is a kind of damage to the file!

Summarize:

Regardless of the editor, to prevent garbled files (it is important to note that the file stored in a piece of code is just an ordinary file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that the file is saved by what code, and it is opened by what encoding! for the UICode data type, no garbled characters will appear regardless of the way it is opened.

(iv) Python program execution:

Phase one: Start the Python interpreter

Stage Two: Loading python files into memory

Phase three: Reading the code that has been loaded into memory (UICode encoded binary) will open up new memory during execution.

Note: Python reads the first line of code in the py file to determine what encoding to use to read memory. If you do not specify the header information #-*-coding:utf-8-*-in the Python file, use the default

Default usage in Python2 in Ascii,python3 utf-8

(v) The difference between Python2 and Python3

The string in Python has two kinds of str=bytes, the variable value is preceded by "U" is defined as UICode, here is involved in the Python2 variable value to open memory space storage is to UICode, The bytes stored in the encode and Python3 is stored directly in UICode mode. There are two types of strings in Python3, and STR is equivalent to the state of "U" in Py2. Bytes refers to the variable value in Py3 can be encode converted to bytes way such as s1=s.encode (' Utf-8 ') s2=s.encode (' GBK '). Python3 default Utf-8,python2 default ascii!

Summary: 1, no matter what file, in what way to take (avoid garbled) in memory fixed use UICode, hard disk can be based on personal preferences

2, the data is first generated in memory, is the UICode format, want to store or network-based transmission, need to convert to bytes format.

UICode---------"encode---------" bytes

Bytes-------"Decode-----------" UICode

Python's character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.