Python's third day base character encoding

Source: Internet
Author: User
Tags alphanumeric characters
A knowledge reserve for understanding character encoding
1. How a text editor accesses files (nodepad ++, pycharm, word)

Open the editor to open and start a process, which is in memory, so the content written in the editor is also stored in memory, and data is lost after power failure

             Therefore, you need to save to the hard disk, and click the save button to flash the data from the memory to the hard disk.

             At this point, we write a py file (not executed), which is no different from writing other files, it is just writing a bunch of characters.

      2. The principle of the Python interpreter executing py files, such as python test.py

The first stage: the python interpreter is started, which is equivalent to starting a text editor

Second stage: The python interpreter is equivalent to a text editor. Open the test.py file and read the contents of the test.py file from the hard disk into memory.

The third stage: the python interpreter interprets and executes the test.py code that has just been loaded into memory



  to sum up:

The python interpreter interprets the contents of the executable file, so the python interpreter has the ability to read py files, just like a text editor
What is different from a text editor is that the python interpreter can not only read the contents of the file, but also execute the contents of the file
Back to top II What is character encoding
For the computer to work, it must be powered on, that is to say, 'electricity' drives the computer to work, and the characteristics of 'electricity' are high and low (high and low means binary number 1, low level means binary number 0), which means that the computer Know only numbers

 

The purpose of programming is to let the computer work, and the result of programming is just a bunch of characters, that is to say, what we want to achieve in programming is: a bunch of characters drive the computer to work

 

So it must go through a process:

Character -------- (translation process) -------> number

This process is actually a standard for how a character corresponds to a specific number, this standard is called character encoding



 The following two scenarios involve character encoding issues:

1. The content of a python file is composed of a bunch of characters

2. Data type strings in Python are composed of a string

Back to top Three History of Character Encoding
Phase 1: Modern computers originated in the United States, and the earliest birth of ASCII was also based on English considerations

ASCII: One Bytes represents one character (English characters / all other characters on the keyboard), 1Bytes = 8bit, 8bit can represent 0-2 ** 8-1 variations, which can represent 256 characters

ASCII originally only used the last seven digits and 127 digits, which can completely represent all the characters on the keyboard (English characters / all other characters on the keyboard)

Later, in order to encode the Latin into the ASCII table, the highest bit is also occupied.

 

Phase 2: In order to meet Chinese, the Chinese customized GBK

GBK: 2Bytes represents a character

 

 In order to meet other countries, each country has customized its own code

Japanese compiles Japanese into Shift_JIS, South Korea compiles Korean into Euc-kr

 

Phase 3: When countries have their own standards, conflicts will inevitably occur. As a result, in multilingual mixed text, there will be garbled characters.

As a result, unicode was generated. Uses 2Bytes to represent a character. 2 ** 16-1 = 65535, which can represent more than 60,000 characters, so it is compatible with IWC.

But for texts that are all in English, this encoding method undoubtedly doubles the storage space (the binary is ultimately stored in the storage medium in an electrical or magnetic manner)

So UTF-8 was generated, which was only represented by 1Bytes for English characters and 3Bytes for Chinese characters.

 

One thing to emphasize is:

unicode: simple and rude, all characters are 2Bytes, the advantage is that the character-> number conversion is fast, the disadvantage is that it takes up a lot of space

UTF-8: Accurate, use different length for different characters, the advantage is to save space, the disadvantage is: the conversion of characters-> numbers is slow, because each time you need to calculate how long the characters need to be in bytes to be able to accurately represent



The encoding used in memory is unicode, and space is used for time (programs need to be loaded into memory to run, so the memory should be as fast as possible)
UTF-8 is used in the hard disk or network transmission. The network I / O delay or disk I / O delay is much larger than the UTF-8 conversion delay, and the I / O should save bandwidth as much as possible to ensure the stability of data transmission.
All programs must be loaded into the memory in the end. The programs are saved to the hard disk in different countries with different encoding formats, but in the memory we are compatible with IWC (computers can run programs in any country because of this), uniform and fixed use of unicode, This is the reason why unicode is fixed in the memory. You might say that it is compatible with IWC. I can use utf-8. Yes, it works perfectly. The reason why it is not necessary is that unicode is more efficient than utf-8. Byte encoding, utf-8 needs to be calculated), but unicode is more wasteful of space, yes, this is a way to exchange space for time, and stored on the hard disk, or network transmission, you need to convert unicode to utf-8 Because the pursuit of data transmission is stability and efficiency, the smaller the amount of data, the more reliable the data transmission, so they are all converted to utf-8 format, not unicode.
Detailed
 

 

Back to top Four. Character encoding classification (simple understanding)
The computer was invented by Americans. The earliest character encoding was ASCII, which only stipulated the correspondence between English alphanumeric characters and some special characters and numbers. It can only be represented by up to 8 bits (one byte), that is: 2 ** 8 = 256, so the ASCII code can only represent up to 256 symbols

Of course, all of our programming languages are English, and ASCII is sufficient, but when processing data, different countries have different languages. Japanese will add Japanese to their programs, and Chinese will add Chinese.

To represent Chinese, it is impossible to express a man with a byte table alone (even elementary school students know more than 2,000 Chinese characters). There is only one solution, that is, a byte is represented by> 8 digits in binary. The more the number of digits, the more changes there are in the representation. In this way, as many unreadable Chinese characters as possible can be expressed.

Therefore, the Chinese have specified their own standard GB2312 encoding, which specifies the correspondence between Chinese characters and digits.

The Japanese stipulated their own Shift_JIS encoding

Koreans set their own Euc-kr code (Also, Koreans say that the computer was invented by them, requiring the world to use the Korean code uniformly)

 

At this time, the problem appeared. Xiao Zhou, who is proficient in 18 languages, wrote a document in 8 languages humblely, then this document will be garbled according to the standards of any country (because the various standards at this moment are only regulations To understand the correspondence between characters and numbers including the characters in your own country, if you simply use one country's encoding format, the characters in the other countries' languages will be garbled when they are parsed)

Therefore, there is an urgent need for a world standard (which can contain the languages of the world), so unicode came into being (the Koreans expressed dissatisfaction, and then there was nothing to use)

ascii uses 1 byte (8 bit binary) to represent a character

Unicode usually uses 2 bytes (16-bit binary) to represent a character, and rare characters need 4 bytes.

example:

The letter x, ascii is 120 in decimal, 0111 in binary

Chinese characters have exceeded the range of ASCII encoding. The Unicode encoding is 20013 in decimal and 01001110 00101101 in binary.

The letter x, which uses binary to represent binary 0000 0000 0111 1000, so unicode is compatible with ascii and also with all nations, which is the world standard

 

At this time, the garbled problem disappeared, we used all the documents, but new problems appeared. If our documents are all in English, your use of unicode will consume twice as much space as ascii, and it is very low in storage and transmission. effect

In the spirit of saving, UTF-8 encoding that turns Unicode encoding into "variable length encoding" has appeared. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers. Common English letters are encoded into 1 byte. Chinese characters are usually 3 bytes. Only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you are transmitting contains a large number of English characters, UTF-8 encoding will save space:

Character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
Medium x 01001110 00101101 11100100 10111000 10101101
It can also be found from the table above that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be considered as part of UTF-8 encoding, so a large number of historical legacy software that only supports ASCII encoding can be used in UTF- Continue to work under 8 codes.

Back to top Five Uses of character encoding Back to top 5.1 Text editor
 

Back to top 5.1.2 nodpad ++ text editor
 

 

Analysis process? What is garbled

The operation of flashing files from memory to hard disk is referred to as saving files.

The operation of reading files from hard disk to memory is referred to as reading files.

Garbled one: garbled when saving files

When saving the file, because the text of each country is included in the file, we simply use shiftjis to save it.

In essence, the text of other countries has failed to store due to no corresponding relationship found in shiftjis. You can test using the write function of open, f = open (‘a.txt’, ’w’, encodig = ‘shift_jis’)

f.write (‘ 你瞅啥\n何を見て\n) # ‘你 瞅 什么’ Because you did n’t find the corresponding relationship in shiftjis, you ca n’t save it successfully, only ‘何 を 见 て \ n’ can be successful

But when we use the file editor to save, the editor will help us to convert, and ensure that Chinese can also be stored using shiftjis (hard storage, inevitably garbled). This has led to garbled characters during the file storage stage.

At this time, when we open the file with shiftjis, Japanese can be displayed normally, but Chinese is garbled.

 

Or, when saving a file:

f = open (‘a.txt’, ‘wb’)

f.write (‘何 を 见 て \ n’.encode (‘ shift_jis ’))
f.write (‘What are you worrying about \ n’.encode (‘ gbk ’))
f.write (‘What are you worrying about \ n’.encode (‘ utf-8 ’))
f.close ()
 

Opening the file a.txt in any encoding will cause the remaining two to not display properly

 

Garbled two: not garbled when saving files and garbled when reading files

Use utf-8 encoding when saving files to ensure compatibility with IWC and no garbled characters. When reading files, the wrong decoding method is selected. For example, gbk, garbled characters occur during the reading phase, and garbled characters during the reading phase can be solved. The decoding method is ok, and garbled characters when saving files are a kind of data corruption.

Back to top 5.1.3 Text Editor pycharm
Save in gbk format

Open in utf-8 format (reload)

The difference between reload and convert:

pycharm is very powerful and provides the function of automatically converting for us, that is, converting characters in the correct format

To explore the nature of character encoding yourself, don't use this

We choose reload, that is, reload the file according to some encoding

Analysis process?

 

to sum up:

Regardless of the editor, you should prevent the file from being garbled (please note that the file storing a piece of code is just an ordinary file, here refers to the garbled when we open the file before the file is executed)

The core rule is that the file is saved in whatever encoding it is opened in.

Back to top 5.1.4 Python interpreter for text editor
The file test.py is saved in gbk format with the content:

X = ‘Lin’

Whether it is

Python2 test.py

still is

Python3 test.py

Will report an error (because python2 default ascii, python3 default utf-8)

Unless #coding: gbk is specified at the beginning of the file

 

Back to top 5.2 Program execution
python test.py (I stress again, the first step in executing test.py must be to read the file content into memory first)

 

Phase one: start the python interpreter

Phase 2: The python interpreter is now a text editor, which is responsible for opening the file test.py, that is, reading the content of test.py from the hard disk into memory

At this point, the python interpreter will read the first line of test.py, #coding: utf-8, to determine what encoding format to read into memory. This line is to set the encoding use of the python interpreter. Encoding format for this encoding,

You can view it with sys.getdefaultencoding (). If you do not specify the header information in the python file #-*-coding: utf-8-*-, then use the default

Ascii is used by default in python2 and UTF-8 is used by default in python3

 

 

 

Phase 3: Read the code (unicode-encoded binary) that has been loaded into memory, and then execute it. During the execution, a new memory space may be opened, such as x =
"egon"

The encoding of the memory uses unicode, which does not mean that the memory is all unicode encoded binary.

Before the program was executed, the memory was indeed unicode-encoded binary. For example, a line x = "egon" was read from the file, where x, equal sign, quotation mark, and status were all the same. They were all ordinary characters. Stored in memory as unicode-encoded binary

However, during the execution of the program, it will apply for memory (there is two spaces with the memory of the program code). It can store data in any encoding format, such as x = "egon", which will be recognized as a string by the python interpreter. Apply for memory space to store "hello", and then let x point to the memory address. At this time, the newly applied memory address holds the unicode-encoded egon. If the code is changed to x = "egon" .encode ('utf-8') , Then the newly applied memory space stores the UTF-8 encoded string egon.

 

 

For python3 as

 

 

When browsing the web, the server will convert the dynamically generated Unicode content to UTF-8 and transmit it to the browser

 

If the encoding format of the server encoding is UTF-8, the UTF-8 encoding binary received in the client's memory is also.

 

Back to top 5.3 Differences between python2 and python3 Back to top 5.3.1 There are two string types str and unicode in python2
str type

When the python interpreter executes the code that generates the string (for example, s = 'Lin'), it will apply for a new memory address, and then encode the 'Lin' into the encoding format specified at the beginning of the file. This is already the result after encode So s can only be decoded

1 # _ * _ coding: gbk _ * _
2 #! / Usr / bin / env python
3
4 x = ‘Lin’
5 # print x.encode (‘gbk‘) #Error
6 print x.decode (‘gbk’) #Result: Lin
 

So it is important to:

In python2, str is the encoded result bytes, str = bytes, so in python2, the unicode character encoding result is str / bytes

 

#coding: utf-8
s = ‘林’ #When executed, ‘林’ will be saved to the new memory space in the form of conding: utf-8

print repr (s) # ‘\ xe6 \ x9e \ x97’ Three Bytes, prove that it is indeed utf-8
print type (s) # <type ‘str‘>

s.decode (‘utf-8’)
# s.encode (‘utf-8’) #Error, s is the encoded result bytes, so it can only be decoded
 

unicode type

When the python interpreter executes the code that generates the string (for example, s = u'Lin '), it will apply for a new memory address, and then store' Lin 'in the new memory space in unicode format, so s can only encode, cannot decode

s = u‘Lin ’
print repr (s) #u ‘\ u6797’
print type (s) # <type ‘unicode’>


# s.decode (‘utf-8’) #Error, s is unicode, so only encode
s.encode (‘utf-8’)
 

 

Print to terminal

What needs special explanation for print is:

When the program executes, such as

x = ‘Lin’

print (x) # This step is to print the memory in the new memory space (not the code space) pointed to by x to the terminal, and the terminal is still running in memory, so this printing can be understood as Print from memory to memory, ie memory-> memory, unicode-> unicode

 

For unicode data, no matter how it is printed, it will not be garbled

The strings in python3 and u'strings in python2 are unicode, so they will not be garbled no matter how they are printed.

In pycharm

In windows terminal

 

 

However, there is another non-unicode string in python2. At this time, print x will execute x.decode ('terminal encoding') according to the terminal encoding, and then print it after it becomes unicode. The encoding specified at the beginning of the file is inconsistent, and garbled is generated

In pycharm (terminal encoding is utf-8, file encoding is utf-8, no garbled)

 

In windows terminal (terminal encoding is gbk, file encoding is utf-8, garbled is generated)

 

 

 

Thinking questions:

Verify the following print results in pycharm and cmd, respectively

#coding: utf-8
s = u ‘林’ #When the program executes, ‘林’ will be saved in the new memory space in unicode


#s points to unicode, so it can be encoded into any format without reporting an encode error
s1 = s.encode (‘utf-8’)
s2 = s.encode (‘gbk’)
print s1 #Printing is normal?
print s2 #Print normally


print repr (s) #u ‘\ u6797’
print repr (s1) # ‘\ xe6 \ x9e \ x97’ encode a Chinese character UTF-8 with 3Bytes
print repr (s2) # ‘\ xc1 \ xd6’ encode a Chinese character gbk with 2Bytes

print type (s) # <type ‘unicode’>
print type (s1) # <type ‘str‘>
print type (s2) # <type ‘str‘>
 

Back to top 5.3.2 There are also two string types str and bytes in python three
str is unicode

#coding: utf-8
s = ‘林’ # When the program is executed, there is no need to add u, ‘林’ will also be saved in the new memory space in unicode form,

#s can directly encode into any encoding format
s.encode (‘utf-8’)
s.encode (‘gbk’)

print (type (s)) # <class ‘str‘>
 

bytes is bytes

#coding: utf-8
s = ‘林’ # When the program is executed, there is no need to add u, ‘林’ will also be saved in the new memory space in unicode form,

#s can directly encode into any encoding format
s1 = s.encode (‘utf-8’)
s2 = s.encode (‘gbk’)



print (s) #Lin
print (s1) #b ‘\ xe6 \ x9e \ x97’ In python3, what is printed is what
print (s2) #b ‘\ xc1 \ xd6’ Ibid

print (type (s)) # <class ‘str‘>
print (type (s1)) # <class ‘bytes’>
print (type (s2)) # <class ‘bytes’>


To sum up
1 Memory fixed use unicode encoding, hard disk encoding (i.e. you can modify the software encoding)
2 What encoding is used to save to the hard disk, then use any encoding to read
3 The program runs in two phases: 1 read from the hard disk to memory, 2 the Python interpreter runs the code that has been read into memory

Python third day basic character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.