Python BASICS (8): character encoding and python Encoding

Last Update:2017-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, we are going to learn about character encoding. Character encoding is a knowledge point with many theories and few conclusions. I will summarize many knowledge points. We only need to read through as an understanding. At last, I will summarize the key points that need to be understood.

I. Basic Computer reserves for learning character encoding

1. Basic computer software operation diagram

2. Principle of file access in the text editor. (Nodepad ++, pycharm, word)

When you open the editor, a process is opened in the memory, so the content written in the editor is also stored in the memory, and data is lost after the power is cut off. Therefore, you need to save the data to the hard disk, click the Save button, and then fl the data from the memory to the hard disk.

At this point, we compile a py file (not executed), which is no different from writing other files. It is just a bunch of characters.

3.How the python interpreter runs the py file, for example, python test. py

Phase 1: Start the python interpreter, which is equivalent to starting a text editor and entering the memory

Stage 2: The python interpreter is equivalent to a text editor to open test. py file, test. the file content of py is read into the memory (small review: the interpretation of pyhon determines that the interpreter only cares about the file content and does not care about the file suffix name)

Stage 3: The python interpreter has just loaded it into the memory for test. py code (ps: at this stage, that is, during execution, the python syntax is recognized, the code in the file is executed, and the name = "egon ", will open up memory space to store the string "egon ")

In the first two phases, the python interpreter and the text editor are the same. The only difference is that in the third phase, the text editor reads content into the memory for display/editing, the python interpreter is used for execution.

2. What is character encoding?

A computer only recognizes numbers consisting of 0 and 1. While we usually use computers, we use characters that humans can understand (the result of programming in advanced languages is nothing more than writing a bunch of characters in a file ). To make the computer understand what we have written, we must go through a process:

Character ----------> translation process ----------> Number

This process is actually how a character corresponds to the standard of a specific number. This standard is called character encoding.

Character encoding is involved in the following two scenarios:

1. The content in a python file is composed of a bunch of characters (when the python file is not executed)

2. The data type string in python is composed of a string of characters (during python File Execution)

Iii. Development of character encoding

1. ASCII

Modern computers originated in the United States and were first born Based on ASCII

ASCII: One Bytes represents one character (English character/all other characters on the keyboard), 1 Bytes = 8 bit, 8 bit can represent 0-2 ** 8-1 changes, it can contain 256 characters

ASCII originally only uses the last seven digits and 127 digits. It can represent all the characters on the keyboard (English letters/all other characters on the keyboard) later, in order to encode the Latin into the ASCII table, the maximum bit is also occupied.

2. Flowers bloom

To meet Chinese requirements, GBK has been customized by the Chinese

GBK: 2Bytes represents a character

To meet the needs of other countries, various countries have customized their own codes.

Compile JapaneseShift_JISSouth Korea makes up KoreanEuc-kr......

3. Minutes must be combined and eventually unified

The standards of various countries will inevitably conflict with each other. As a result, garbled characters are displayed in multi-language texts. As a result, unicode is generated, which uses 2 bytes to represent a single character. 2 ** 16-1 = 65535 characters can represent more than 60 thousand characters, so it is compatible with the universal language.

However, for all the texts in English, this encoding method is undoubtedly twice the storage space. Therefore, a UTF-8 is generated, and only 1 Bytes is used for English characters, and 3 Bytes is used for Chinese characters.

4. Note:

Unicode: simple and crude. All characters are 2 Bytes. The advantage is that character-> Number conversion is fast, but the disadvantage is that it occupies a large space.

UTF-8: accurate. Different characters are expressed with different lengths. The advantage is space saving. The disadvantage is that character-> Number conversion is slow, because each time, the length of Bytes required to calculate the characters can be accurately expressed.

1) The encoding used in the memory is unicode and the space is used for time (the program must be loaded to the memory to run, so the memory should be as fast as possible)

2) UTF-8 is used for hard disks or network transmission. The network I/O latency or disk I/O latency is much higher than the UTF-8 conversion latency, and I/O should save the bandwidth as much as possible, ensure data transmission stability.

Iv. Practical Application of character encoding

Unicode -----> encode --------> UTF-8

UTF-8 --------> decode ----------> unicode

1. garbled characters

Operations for flushing files from memory to Hard Disk

File Read operations (read files) from hard disk to memory

1) gibberish 1: Garbled characters are found during file storage.

For example, when saving a file, because the file contains texts from various countries, we will save it with the Japanese Code shiftjis,

In essence, text in other countries fails to be stored because no corresponding relationship is found in shiftjis, 'You can't find the ing relation in shiftjis. 'Why? \ n' can find the ing. However, when we use the file editor to store the file, the editor will help us with the conversion to ensure that Chinese characters can also be stored with shiftjis (hard storage, which must be garbled), which leads, garbled characters have already occurred in the file storage phase. At this time, when we open the file with shiftjis, the Japanese can be properly displayed, while the Chinese text is garbled. Or:

F=open('a.txt ', 'wb') f. write ('He can see limit \ n '. encode ('shift _ jis ') f. write ('What are you worried about \ n '. encode ('gbk') f. write ('What are you worried about \ n '. encode ('utf-8') f. close ()

When you open file a.txt in any way, the remaining two problems cannot be displayed normally (open, write is the knowledge point of the next file processing is not everyone's concern, just look at the encoding part)

2) Garbled text 2: Garbled text when the file is read without Garbled text

The file is encoded in UTF-8 to ensure compatibility with all nations and no garbled characters. When the file is read, an incorrect decoding method, such as gbk, is used, garbled characters in the read phase can be solved. If you select the correct decoding method, It will be OK. If you save the file with garbled characters, it will be a type of data corruption.

Summary:

No matter what the editor is, the core rule to prevent file garbled characters is to open the file in any encoding mode when the file is saved in any encoding mode.

2. Note:

The test. py file is saved in gbk format and the content is: x = 'line'

Whether it is python2 test. py

Or python3 test. py

Errors are reported (because python2 defaults to ascii and python3 defaults to UTF-8)

Unless specified at the beginning of the file # coding: gbk

V. Summary

1. What encoding should be used for storage?
Ps: the memory uses unicode encoding,
The encoding we can control is stored on the hard disk or selected Encoding Based on Network Transmission.

2. Data is first generated in memory and is in unicode format. to transmit data, it must be converted to bytes format.
# Unicode -----> encode (UTF-8) ------> bytes
After obtaining bytes, you can put it in the file memory or transmit it over the network.
# Bytes ------> decode (gbk) -------> unicode

3. Strings in python3 are recognized as unicode
The string encode in python3 gets bytes.

4. Understand
The string in python2 is bytes.
In python2, u is added before the string, which is unicode.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python BASICS (8): character encoding and python Encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support