Python (character encoding)

Source: Internet
Author: User

A knowledge reserve for understanding character encoding

  1. How the text editor accesses the file (Nodepad++,pycharm,word)

Opening the editor opens a process that is in memory, so content written in the editor is also stored in memory, and data is lost after a power outage

So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive.

At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.

2. How the Python interpreter executes the py file, such as Python test.py

First stage: The Python interpreter starts, which is equivalent to launching a text editor

Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk

Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory

  Summarize:

    1. The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor
    2. Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.
Two what is character encoding

Computers want to work must be energized, that is, ' electricity ' drives the computer to work, and the ' power ' is the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number

The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work

So you have to go through a process:

character--------(translation process)-------> Numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.

The history of three-character coding

Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII

ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters

ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)

Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied

Stage two: In order to satisfy Chinese, the Chinese have customized the GBK

Gbk:2bytes represents a character

In order to satisfy other countries, each country has to customize its own code

Japan put the Japanese Shift_JIS in, South Korea to the Korean Euc-kr in the

Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.

The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language

But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)

Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes

One thing to emphasize is:

Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large

Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent

    1. The encoding used in memory is Unicode, with space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)
    2. In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.

Four. Character encoding classification

The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers.

ASCII uses 1 bytes (8-bit binary) to represent one character

Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes

If all of our documents are in English, you can use Unicode more space than ASCII, which is inefficient in storage and transmission.

In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
In X 01001110 00101101 11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Five-character encoding using the 5.1 text editor Yiguoduan

5.1.2 Text Editor nodpad++

Summarize:

No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that what code the file is stored in, and how it's coded to open it.

While the file encoding is used in the encoding method is the lower right corner of the encoding, and decoding is the use of the document at the beginning of the declaration of the encoding, the two codes are very prone to garbled when different.

5.2 Execution of the program

Python test.py (I'll emphasize again that the first step in executing test.py must be to read the contents of the file into memory first)

Phase one: Start the Python interpreter

Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory

At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code,

Can be viewed with sys.getdefaultencoding (), if you do not specify the header information #-*-coding:utf-8-*-in the Python file, then use the default

Default usage in Python2 in Ascii,python3 utf-8

Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"

The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,

Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory

However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.

For python3 such as

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser

If the encoding format of the server-side encode is utf-8, the client in-memory receives the UTF-8 encoded binary as well.

The difference between 5.3 python2 and Python3 5.3.1 There are two types of strings in Python2 str and Unicode

STR type

When the Python interpreter executes the code that produces the string (for example, s= ' forest '), it requests a new memory address and then encode the ' forest ' to the encoding format specified at the beginning of the file, which is already the result of encode, so s can only decode

1 #_ *_coding:gbk_*_2 #!/usr/bin/env python3 4 x= ' forest ' 5 # Print X.encode (' GBK ') #报错6 print x.decode (' GBK ') #结果: Forest

So the important point is:

In Python2, STR is the encoded result bytes,str=bytes, so in python2, the result of Unicode character encoding is str/bytes

#coding: utf-8s= ' Forest ' #在执行时, ' forest ' will be saved in conding:utf-8 form to the new memory space in print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-8print Type (s) #<type ' str ' >s.decode (' Utf-8 ') # s.encode (' Utf-8 ') #报错, s for encoded results bytes, so only decode

Unicode type

When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode

Print to Terminal

Special instructions for print are:

When the program is executed, such as

x= ' Forest '

Print (x) #这一步是将x指向的那块新的内存空间 (not the memory space in which the code resides) is printed to the terminal, and the terminal is still running in memory, so this printing can be understood as printing from memory to memory, that is, memory,unicode-> Unicode

For data in Unicode format, no matter how it is printed, it is not garbled.

The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.

In the Pycharm

In the Windows terminal

However, there is another non-Unicode string in the Python2, at this time, print x, will be executed according to the terminal Code x.decode (' Terminal code '), after the Unicode, and then print, when the terminal encoding and the file at the beginning of the specified encoding inconsistent, garbled generated

In Pycharm (the terminal code is utf-8, the file is encoded as UTF-8, it is not garbled)

In Windows terminal (Terminal encoded as GBK, file encoded as Utf-8, garbled generated)

Study Questions

Verify the following print results in Pycharm and CMD, respectively

#coding: Utf-8s=u ' Forest ' #当程序执行时, ' forest ' will be saved in Unicode form in the new memory space #s points to Unicode, so it can be encoded in any format, will not be reported encode error s1=s.encode (' Utf-8 ') S2=s.encode (' GBK ') print S1 #打印正常否? Print S2 #打印正常否print repr (s) #u ' \u6797 ' Print repr (S1) # ' \xe6\x9e\x97 ' encode a kanji utf-8 with 3Bytesprint repr (S2) # ' \xc1\xd6 ' Encode a kanji GBK with 2Bytesprint type (s) #<type ' Unicode ' >print type (S1) #<type ' str ' >print type (s2) #<type ' str ' >

5.3.2 also has two string types in Python3 str and bytes

STR is Unicode

#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s. Encode (' Utf-8 ') s.encode (' GBK ') Print (type (s)) #<class ' str ' >

Bytes is bytes.

#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s1 =s.encode (' Utf-8 ') s2=s.encode (' GBK ') print (s) #林print (S1) #b ' \xe6\x9e\x97 ' in Python3, what is printed on what print (s2) #b ' \xc1\xd6 ' ibid. print (type (s)) #<class ' str ' >print (Type (S1)) #<class ' bytes ' >print (type (s2)) #<class ' bytes ' >

Python (character encoding)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.