Py encoding Ultimate Edition, py Ultimate Edition

Source: Internet
Author: User

Py encoding Ultimate Edition, py Ultimate Edition

Speaking of python coding, it's really sad. In this case, I have been tossing for two months. Fortunately, I finally sorted it out. As a communist, you must share it with everyone. If you are still suffering from a headache due to encoding, follow me to uncover the truth of py encoding!

1. What is encoding?

The basic concept is simple. First, let's start with a piece of information, that is, a message that can be understood and understood by humans. I plan to call this representation "plain text ). For English speakers, English words printed on paper or displayed on the screen are counted as plain text.

Second, we need to be able to convert the message in plaintext to another representation, and we also need to be able to convert the encoded text back to the plain text. The conversion from plain text to the encoded text is called "encoding", and the conversion from the encoded text to the plain text is "decoding ".

1. encoding is a big problem. If it is not completely solved, it will be like a snake hidden in the jungle, biting you from time to time. 2. What is encoding? 3 4 // ASCII 5 6 remember one sentence: all the data in the computer, whether it is text, images, video, or audio files, is essentially stored in binary format similar to 01010101. 7. Simply put, computers only know binary numbers! 8. The purpose is clear: how can we uniquely match a group of binary numbers that we can recognize? As a result, the comrades in the United States think that a level of high and low state can represent 0 or 1, and 9 levels can be regarded as a group to indicate 10 or 256 different states, each State corresponds to A single character, for example, A ---> 00010001, and only 26 characters in English. Some special characters and numbers are included, and 128 States are enough for 11; each level is called a bit. Eight BITs constitute a byte, so that the computer can store English text in 127 different bytes. This is the ASCII code. 12 13 extended ANSI encoding 14 I just mentioned that at the beginning, a byte had eight bits, but the maximum bit was useless. The default value is 0. Later, it can also represent Latin for the computer, the last character set is also used. The 15 character sets from 128 to 255 correspond to Latin. So far, one byte is full! 16 17 // GB231218 19 after the computer traveled across the ocean to China, the problem came. The computer did not know Chinese and certainly could not display Chinese. Moreover, all the statuses of a byte were full, ten evil imperialism is dead 20 my heart is not dead! Our party is also great and self-reliant. We have to rewrite a table and delete all the latans corresponding to the extended eighth digit. It is stipulated that the meaning of a character less than 127 is the same as that of the original one, however, when two characters larger than 127 are connected together, it indicates a Chinese character. The first byte (also known as the high byte) uses 0xF7 from 0xA1 and the next byte 22 (low byte) from 0xA1 to 0xFE, we can combine more than 7000 simplified Chinese characters. This Chinese Character scheme is called "GB2312 ". GB2312 is an extension of Chinese ASCII. 23 24 // GBK and GB18030 are encoded as 25 26 but there are too many Chinese characters, and GB2312 is not enough. Therefore, it is specified that as long as the first byte is greater than 127, it indicates that this is the beginning of a Chinese character, whether or not it is followed by 27 content in the extended character set. The expanded encoding scheme is called the GBK standard. GBK includes all the content of GB2312, and adds nearly 20000 New Chinese characters (including traditional Chinese characters) and symbols. 28 29 // UNICODE encoding: 30 31 many other countries have developed their own encoding standards, but they do not support each other. This poses many problems. Therefore, the Standardization Organization for unified encoding: The standard encoding standard is 32: UNICODE. 33 UNICODE is expressed as a character in two bytes. It can combine a total of 65535 different characters, which is sufficient to cover all the symbols in the world (including Oracle) 34 35 // utf8: 36 37 unicode is all over the world. Why is there a UTF-8 encoding? 38 everyone thinks that for people in the English world, one byte is enough. For example, if we want to store A, we can use 00010001 bytes. Now we have A unicode dinner, and 39 needs to use two bytes: 00000000 million RMB, too much waste! 40 based on this, scientists in the United States put forward the idea of genius: utf8.41 UTF-8 (8-bit Unicode Transformation Format) is a variable length character encoding for Unicode, it can use 1 ~ Four bytes indicate one character. The length varies according to 42 characters. When the character is within the ASCII code range, it is represented by one byte. Therefore, it is compatible with ASCII encoding. 43 44 the obvious advantage is that although the data in our memory is unicode, when the data is to be saved to the disk or used for network transmission, using unicode directly saves utf8 space! 45 that's why utf8 is our recommended encoding method. 46 47 relationship between Unicode and utf8: In 48 words, Unicode is a memory encoding Representation Scheme (standard), and how UTF saves and transfers Unicode (Implementation) this is also the difference between UTF and Unicode.

Supplement: How does utf8 save hard disk and traffic?

S = "I'm high"

The unicode Character Set you see is an encoded table like this:

I 0049 '0027 m 006d 0020 of its 82d1 high 660a

Each character corresponds to a hexadecimal number.
The computer only understands binary, so it should be stored in the unicode mode (UCS-2) as follows:

I 00000000 01001001 '00000000 00100111 m 00000000 01101101 00000000 00100000 of its 10000010 11010001 high 01100110 00001010

This string occupies 12 bytes in total, but compared with the binary code in Chinese and English, we can find that the first 9 digits in English are all 0! Waste, hard disk, and traffic. What should I do?UTF8:

I 01001001 '00100111 m 01101101 00100000 of its 11101000 10001011 10010001 high 11100110 10011000 10001010

Utf8 uses 10 bytes, which is two fewer than unicode. Because our program has more English than Chinese, the space will be improved a lot!

Remember: Everything is to save your hard disk and traffic.

String encoding of binary py2

In py2, there are two types of strings: str and unicode. Note that these are only two names defined by python, what is the memory address of the two data types when the program is running?

Let's take a look:

# Coding: utf8 s1 = 'start' print type (s1) # <type 'str'> print repr (s1) # '\ xe8 \ x8b \ x91 s2 = U' high 'print type (s2) # <type 'unicode'> print repr (s2) # U' \ u82d1'
 

The built-in function repr can help us display the stored content here. Originally, str and unicode respectively store the byte data and unicode data. What are the concerns between the two types of data? How to convert it? This involves encoding and decoding.

S1 = U' 'print repr (s1) # U' \ u82d1 'B = s1.encode ('utf8') print bprint type (B) # <type 'str'> print repr (B) # '\ xe8 \ x8b \ x91' s2 = 'high 'u = s2.decode ('utf8 ') print u # its high print type (u) # <type 'unicode '> print repr (u) # U' \ u82d1 \ u660a' # Note u2 = s2.decode ('gbk ') print u2 # its high print len ('high ') #6

Both utf8 and gbk are only encoding rules, and unicode data is encoded into bytes. Therefore, UTF-8 encoded bytes must be decoded according to the utf8 rules, otherwise, garbled characters or errors may occur.

Py2 encoding features:
# Coding: utf8 print 'high' # its high print repr ('high ') # '\ xe8 \ x8b \ x91 \ xe6 \ x98 \ x8a' print (u "hello" + "qi") # print (u 'high' + 'handler ') # UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe6 # in position 0: ordinal not in range (128)

Python 2 quietly masks the conversion from byte to unicode. As long as all the data is ASCII, all the conversions are correct. Once a non-ASCII character is secretly entered into your program, the default decoding will be invalid, resulting in a UnicodeDecodeError error. Py2 encoding makes it easier for programs to process ASCII data. The cost of your comeback is that it will fail when processing non-ASCII data.

String encoding of py3

Python3 renamed the unicode type to str, the old str type has been replaced by bytes.

Py3 also has two data types: str and bytes; str stores unicode data, and bytse stores bytes data, which only has a different name than py2.

Import jsons = 'high' print (type (s) # <class 'str'> print (json. dumps (s) # "\ u82d1 \ u660a" B = s. encode ('utf8') print (type (B) # <class 'bytes '> print (B) # B '\ xe8 \ x8b \ x91 \ xe6 \ x98 \ x8a' u = B. decode ('utf8') print (type (u) # <class 'str'> print (u) # Yuan Hao print (json. dumps (u) # "\ u82d1 \ u660a" print (len ('high') #2

Py3 coding philosophy:

The most important new feature of Python 3 is that it makes a clearer distinction between text and binary data, and does not automatically decode bytes. The text is always Unicode, represented by the str type, and the binary data is represented by the bytes type. Python 3 does not mix str and bytes in any implicit way, which makes the distinction between the two very clear. You cannot concatenate strings and byte packets, search for strings in a byte packet (or vice versa), or input a string as a byte packet parameter (or vice versa ).

# Print ('alvin '+ u'qi') # byte string and unicode connection py2: alvinyuanprint (B 'alvin '+ 'qi') # byte string and unicode connection py3: error: can't concat bytes to str
 

Note: No matter py2 or py3, the unicode data directly corresponds to the plaintext, and the printed unicode data will display the corresponding plaintext (including English and Chinese)

Encoding of four files from disk to memory (******)

Speaking of this, we have come to the point!

Aside from executing the execution program, I would like to ask if you have used the text editor. If you don't know what it is, you have used the word. OK. When we edit the text on the word, no matter whether it is Chinese or English, computers do not know, So what form of memory does the data exist before it is saved? Yes, it is unicode data. Why do we need to store unicode data? This is because its name is the best: The world code! It can be interpreted as a unique encoding for English, Chinese, Japanese, Latin, and any character in the world, so compatibility is the best.

Well, what is the data stored on the disk?

The answer is a bytes byte string encoded in some way. For example, utf8 is a variable-length encoding that saves space. Of course, there are also gbk encoding of historical products. Therefore, our text editor software has the default encoding method for saving files, such as utf8 and gbk. When we click Save, these editing software has "Silently" helped us with coding.

When we open this file again, the software quietly decoded the data, decoded it into unicode, and then presented the plaintext to the user! Therefore, unicode is closer to the user, and bytes is closer to the computer.

After talking so much about it, what is the relationship with our program execution?

First, clarify the concept: The py interpreter itself is a software, a software similar to a text editor!

Now let's restore the encoding process from creation to execution of a py file:

Open pycharm, create the hello. py file, and write

Ret = 1 + 1 s = 'high 'print (s)

When we save it, hello. the py file is saved to the disk in pycharm's default encoding method. When the file is closed and opened again, pycharm then decodes the content read after the file is opened in the default encoding mode, after converting to unicode to memory, we can see our plaintext;

If you click the run button or run the file on the command line, The py interpreter is called to open the file and then decode the bytes data on the disk to unicode data, this process is the same as the editor. The difference is that the interpreter will translate the unicode data into C code and then convert it into binary data streams, finally, the entire process is completed by controlling the operating system to call the cpu to execute the binary data.

So the question is, our text editor has its own default encoding and decoding method. Does our interpreter have it?

Of course, py2's default ASCII code and py3's default utf8 can be queried as follows

import sysprint(sys.getdefaultencoding())

Do you still remember this statement?

#coding:utf8

Yes, this is because if the py2 interpreter executes an utf8 encoded file, it will decode utf8 with the default ASCII. Once the program has Chinese characters, the decoding will naturally be wrong, therefore, we declare # coding: utf8 at the beginning of the file to tell the interpreter that you should not decode the file in the default encoding mode, but use utf8 to decode the file. The interpreter of py3 is much more convenient because it is UTF-8 encoded by default.

Note: The string encoding we mentioned above is the storage status when the cpu executes the program. It is another process. Do not confuse it!

Five common coding problems 1. garbled characters in cmd

Hello. py

# Coding: utf8print ('height ')

The encoding when the file is saved is also utf8.

Thinking: why is there no question when the idea is executed with 2 or 3? What if 3 is correct under cmd.exe and 2 is garbled?

We can execute the command "cmd.exe" in the winnext example. In other words, cmd.exe itself is also a software. When we use the GBK decoding method of python2 to decode utf8, garbled characters will naturally occur.

The original cause of py3's is to transmit the unicodedata to the user. The content can be identified by cmd.exe, so it is okay to display.

When you understand the principles, there are many ways to modify them, such:

Print (u'its high ')

If this is the case, it will not be a problem to use 2 in cmd.

2. Encoding Problems in open ()

Create a hello text and save it as utf8:

Its high, you are the most handsome!

Create an index. py in the same directory

f=open('hello')print(f.read())

Why is the result normal in linux: Yuan Hao, in win, garbled: zookeeper interpreter (py3 )?

Because your Windows operating system is installed with the default gbk encoding, while the linux operating system uses the utf8 encoding by default;

When the open function is executed, the open file is called by the operating system. The operating system uses the default gbk encoding to decode the utf8 file, which is naturally garbled.

Solution:

f=open('hello',encoding='utf8')print(f.read())

If your file stores gbk encoding, you do not need to specify encoding in win.

In addition, if you do not need to specify encoding = 'utf8' for the operating system on Windows, it is the default utf8 encoding during installation or the utf8 encoding has been modified through commands.

Note: The open function is different in py2 and py3, and py3 has an encoding = None parameter.

Note that you need to view: Click here

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.