Python character encoding

Last Update:2017-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A knowledge reserve for understanding character encoding

　　1. How the text editor accesses the file (Nodepad++,pycharm,word)

Opening the editor opens a process that is in memory, so content written in the editor is also stored in memory, and data is lost after a power outage

So you need to save to your hard drive and click the Save button to swipe the data from memory to your hard drive.

At this point, we write a py file (no execution), no different from writing other files, just writing a bunch of characters.

2. How the Python interpreter executes the py file, such as Python test.py

First stage: The Python interpreter starts, which is equivalent to launching a text editor

Second stage: The Python interpreter is equivalent to a text editor to open the test.py file and read the contents of the test.py file into memory from the hard disk

Phase three: The Python interpreter interprets the code that executes the test.py that was just loaded into memory

　　Summarize:

The Python interpreter interprets the contents of the executable file, so the Python interpreter has the ability to read the Py file, as is the case with a text editor
Unlike a text editor, the Python interpreter can read not only the contents of the file, but also the contents of the file.

Two what is character encoding

Computers want to work must be energized, that is, ' electricity ' drives the computer to work, and the ' power ' is the high and low level (high and low levels are binary number 1, the lower level is the binary number 0), that is, the computer only know the number

The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work

So you have to go through a process:

character--------(translation process)-------> Numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.

The history of three-character coding

Phase One: Modern computers originated in the United States, and the earliest birth was also based on the English-considered ASCII

ASCII: A bytes represents one character (English characters/all other characters on the keyboard), 1bytes=8bit,8bit can represent 0-2**8-1 variations, which can represent 256 characters

ASCII originally used only the last seven digits, 127 digits, has been fully able to represent all the characters on the keyboard (English characters/keyboard all other characters)

Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied

Stage two: In order to satisfy Chinese, the Chinese have customized the GBK

Gbk:2bytes represents a character

In order to satisfy other countries, each country has to customize its own code

Japan put the Japanese Shift_JIS in, South Korea to the Korean Euc-kr in the

Stage Three: countries have national standards, there will inevitably be conflicts, the result is that in the multi-language mixed text, the display will be garbled.

The resulting Unicode, unified 2Bytes for a character, 2**16-1=65535, can represent more than 60,000 characters, thus compatible with the universal language

But for texts that are all English-language, this encoding is undoubtedly one-fold more storage space (the binary is ultimately stored in the storage medium in the form of electricity or magnetism)

Thus produced the UTF-8, the English characters only with 1Bytes, the Chinese characters with 3Bytes

One thing to emphasize is:

Unicode: Simple rough, all characters are 2Bytes, the advantage is the character---digital conversion speed, the disadvantage is the space-occupying large

Utf-8: precision, for different characters with different lengths, the advantage is to save space, the disadvantage is: character---number conversion speed is slow, because each time you need to calculate how long the character needs bytes to be able to accurately represent

The encoding used in memory is Unicode, with space-time (the program needs to be loaded into memory to run, so the memory should be as fast as possible)
In the hard disk or network transmission with UTF-8, network I/O latency or disk I/O latency is much larger than the utf-8 conversion delay, and I/O should be as much as possible to save bandwidth, ensure the stability of data transmission.

Four. Character encoding classification (easy to understand)

The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols

Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.

And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters

So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.

The Japanese have set their own shift_jis codes.

Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)

At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.

So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)

ASCII uses 1 bytes (8-bit binary) to represent one character

Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes

Cases:

The letter x, denoted by ASCII is decimal 120, binary 0111 1000

Chinese characters are 中 beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .

The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard

This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient

In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character	ASCII	Unicode	UTF-8
A	01000001	00000000 01000001	01000001
In	X	01001110 00101101	11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Five-character encoding using the 5.1 text editor Yiguoduan

5.1.2 Text Editor nodpad++

Analysis process? What is garbled

Files from memory brush to hard disk operations for short files

Files read from hard disk to memory for short read files

Garbled one: files are garbled when they are stored

Save the file, because the document has the text of each country, we shiftjis to save,

In essence, the writing of the Open function can be tested, F=open (' A.txt ', ' W ', encodig= ' Shift_JIS ') due to the lack of correspondence in the ShiftJIS and the failure of the storage in other countries.

F.write (' What do you see て\n ') # ' What do you see ' because there is no correspondence in ShiftJIS to save success, only ' how to seeて\n' can succeed

But when we use the file editor to save the time, the editor will help us do the conversion, to ensure that the Chinese can also be used ShiftJIS storage (hard to save, it must be garbled), which led to the file stage has been garbled

In this case, when we open the file with ShiftJIS, Japanese can display normally, while Chinese is garbled.

Or, when you save the file:

F=open (' a.txt ', ' WB ') f.write (' How to see て\n '. Encode (' Shift_JIS ')) f.write (' You're worried '. Encode (' GBK ')) f.write (' What are you worried about '). Encode ( ' Utf-8 ')) F.close ()

Opening a file with any encoding a.txt the remaining two problems that are not displayed correctly

Garbled two: When the file is not garbled and read the file garbled

Save the file with Utf-8 encoding, to ensure that compatible with all nations, not garbled, and read the file when the wrong decoding method, such as GBK, then in the reading stage garbled, read the stage garbled is can be resolved, select the correct decoding method is OK, and the file is garbled, it is a kind of data corruption.

5.1.3 Text Editor Pycharm

Save in GBK format

Open in Utf-8 format

Analysis process?

Summarize:

No matter what the editor, to prevent garbled files (please note that the file stored in a piece of code is just a normal file, here refers to the file is not executed before we open the file when the garbled)

The core rule is that what code the file is stored in, and how it's coded to open it.

5.2 Execution of the program

Python test.py (I'll emphasize again that the first step in executing test.py must be to read the contents of the file into memory first)

Phase one: Start the Python interpreter

Stage two: The Python interpreter is now a text editor responsible for opening the file test.py, which reads the contents of the test.py from the hard disk into memory

At this point, the Python interpreter reads the first line of the test.py, #coding: Utf-8, to determine what encoding format to read into memory, this line is to set the Python interpreter this software encoding using the encoding format this code,

Can be viewed with sys.getdefaultencoding (), if you do not specify the header information #-*-coding:utf-8-*-in the Python file, then use the default

Default usage in Python2 in Ascii,python3 utf-8

Phase three: Reads the code that has been loaded into memory (Unicode encoded binary), then executes, and may open up new memory space during execution, such as x= "Egon"

The encoding of memory uses Unicode, which does not mean that all memory is Unicode encoded in binary,

Before the program executes, the memory is indeed Unicode encoded binary, such as reading from the file a line x= "Egon", where the X, equals, quotes, status are the same, all ordinary characters, are in Unicode encoded binary form stored in memory

However, in the course of execution, the program will apply for memory (and the memory of the program code is two spaces), can be stored in any encoded format data, such as x= "Egon", will be recognized as a string by the Python interpreter, will request memory space to hold "Hello", and then let X point to the memory address, At this time the memory address of the new application is also Unicode encoded Egon, if the code is replaced with x= "Egon". Encode (' Utf-8 '), then the new application memory space is UTF-8 encoded string Egon.

For python3 such as

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser

If the encoding format of the server-side encode is utf-8, the client in-memory receives the UTF-8 encoded binary as well.

The difference between 5.3 python2 and Python3 5.3.1 There are two types of strings in Python2 str and Unicode

STR type

When the Python interpreter executes the code that produces the string (for example, s= ' forest '), it requests a new memory address and then encode the ' forest ' to the encoding format specified at the beginning of the file, which is already the result of encode, so s can only decode

1 #_ *_coding:gbk_*_2 #!/usr/bin/env python3 4 x= ' forest ' 5 # Print X.encode (' GBK ') #报错6 print x.decode (' GBK ') #结果: Forest

So the important point is:

In Python2, STR is the encoded result bytes,str=bytes, so in python2, the result of Unicode character encoding is str/bytes

#coding: utf-8s= ' Forest ' #在执行时, ' forest ' will be saved in conding:utf-8 form to the new memory space in print repr (s) # ' \xe6\x9e\x97 ' three bytes, proving to be really utf-8print Type (s) #<type ' str ' >s.decode (' Utf-8 ') # s.encode (' Utf-8 ') #报错, s for encoded results bytes, so only decode

Unicode type

When the Python interpreter executes the code that produces the string (for example, S=u ' forest '), it requests a new memory address and then stores the ' Forest ' in Unicode format in the new memory space, so s can only encode and cannot be decode

Print to Terminal

Special instructions for print are:

When the program is executed, such as

x= ' Forest '

Print (x) #这一步是将x指向的那块新的内存空间 (not the memory space in which the code resides) is printed to the terminal, and the terminal is still running in memory, so this printing can be understood as printing from memory to memory, that is, memory,unicode-> Unicode

For data in Unicode format, no matter how it is printed, it is not garbled.

The string in python3 and the U ' string ' in Python2 are Unicode, so printing is not garbled anyway.

In the Pycharm

In the Windows terminal

However, there is another non-Unicode string in the Python2, at this time, print x, will be executed according to the terminal Code x.decode (' Terminal code '), after the Unicode, and then print, when the terminal encoding and the file at the beginning of the specified encoding inconsistent, garbled generated

In Pycharm (the terminal code is utf-8, the file is encoded as UTF-8, it is not garbled)

In Windows terminal (Terminal encoded as GBK, file encoded as Utf-8, garbled generated)

Study Questions

Verify the following print results in Pycharm and CMD, respectively

#coding: Utf-8s=u ' Forest ' #当程序执行时, ' forest ' will be saved in Unicode form in the new memory space #s points to Unicode, so it can be encoded in any format, will not be reported encode error s1=s.encode (' Utf-8 ') S2=s.encode (' GBK ') print S1 #打印正常否? Print S2 #打印正常否print repr (s) #u ' \u6797 ' Print repr (S1) # ' \xe6\x9e\x97 ' encode a kanji utf-8 with 3Bytesprint repr (S2) # ' \xc1\xd6 ' Encode a kanji GBK with 2Bytesprint type (s) #<type ' Unicode ' >print type (S1) #<type ' str ' >print type (s2) #<type ' str ' >

5.3.2 There are two types of string in Python three, str and bytes

STR is Unicode

#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s. Encode (' Utf-8 ') s.encode (' GBK ') Print (type (s)) #<class ' str ' >

Bytes is bytes.

#coding: utf-8s= ' Forest ' #当程序执行时, no need to add u, ' Forest ' will also be stored in Unicode form in the new memory space, #s可以直接encode成任意编码格式s1 =s.encode (' Utf-8 ') s2=s.encode (' GBK ') print (s) #林print (S1) #b ' \xe6\x9e\x97 ' in Python3, what is printed on what print (s2) #b ' \xc1\xd6 ' ibid. print (type (s)) #<class ' str ' >print (Type (S1)) #<class ' bytes ' >print (type (s2)) #<class ' bytes ' >

Python character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python character encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python character encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support