19. python encoding, 19. python Encoding
Before the formal description, I would like to give you a reference: stamp here
The content of this article is based on this document and summarized. To avoid incomplete summary or errors, you can refer to the previous article if you have any questions.
The following description is applicable to Python 2.x, because Unicode is used by default in Python 3.x.
The following describes the encoding in python. First, let's look at the encoding.
1. ASCII
ASCII represents a character in one byte, and a byte is composed of eight-bit binary, so it can produce 2 ** 8 = 256 types of changes. In the age of computer birth, it is used to indicate 26 upper and lower case English letters, plus more than enough symbols. This is also the default encoding used in python2.x. Therefore, Chinese characters cannot be used by default in python2.x unless declared by encoding.
2. MBCS
With the development of the times, ASCII is enough. Computers cannot only display English, which is too low. At this time, we can see that there are still useless parts in the ASCII code, so we all want to take up the remaining part, but the remaining part is still not enough. For example, we must not use so many Chinese characters, so now I have extended it again. If one byte is not enough, I will use two. To be compatible with ASCII codes, the following rule is defined: if the first byte is below \, it still represents ASCII characters; if it is above, the next byte (two bytes in total) represents one character, then the next byte is skipped, and further judgment is continued. For example, GB... and BIG... all follow these rules.
However, it was too messy. At this time, IBM jumped out and shouted: all these things should be managed in a unified manner !! Therefore, the concept of the code page is obtained, and these character sets are indexed and paged. the generic name of these pages is MBCS. For example, if GBK is on page 936, it is also called cp936. And everyone is using double byte, so it is also calledDBCS.
But obviously, MBCS collects various character sets, but you can't say that you want to use the MBCS character set encoding. What types of character sets are stored in it, I can't give you a random one. Therefore, it must be specified, but this task has been completed by the operating system itself (not supported by linux), and the operating system is sometimes selected based on the region. For example, for the Simplified Chinese version, select GBK. For other countries, the specific version varies. Therefore, once MBCS/BDCS is used in the python coding declaration, it is inevitable to report errors when the system or cross-region operation is performed. Therefore, the encoding Declaration must be specific, such as UTF-8, which will not cause various encoding errors due to system and regional differences.
In windows, Microsoft uses an alias named ANSI, which is actually MBSC.
3. Unicode
Although MBSC solves the problem of chaotic encoding to some extent, it is still characteristic encoding that can only display characteristic characters. In this way, it is very difficult to develop a program that adapts to multiple languages. At this time, people are wondering whether there is a kind of encoding that can get the characters. After some research, Unicode was born. We don't want to expand it on ASCII, which makes various versions so messy. In the future, we will save it in two bytes. In this way, we can use 256*256 = 65536 characters, which is enough. This is the UCS-2 standard. Later, some people said it was not enough, so simply doubled it and represented it in four bytes, 256 ** 4 = 4294967296. Even if it was to represent alien text in the future, it would take a while. Of course, the commonly used UCS-2 standards.
The Unicode Character Set is only a table (that is, a byte) that corresponds to a Character's bitwise. For example, the bitwise of the Character "Han" is 6C49. The specific transmission and storage of characters is the responsibility of UTF (uctransformation Format) (that is, the storage of bytes ). (Note: It indicates that the byte is not equal to the saved byte, that is, although I use two bytes to represent the character, it is not necessarily saved as the byte used for representation when I save it)
At the beginning, it was saved directly using the UCOS code bit, which is the UTF-16, for example, "Han" saved directly using \ x6C \ x49 (UTF-16-BE ), or use \ x49 \ x6C to save (UTF-16-LE ). However, the Americans didn't want to do it later. I used to use ASCII to get it in only one byte, but now it takes two bytes, which is a double the length.What is a double concept? It is nearly 0.1 billion if it is rounded down.Really when I do not need money disk space, in order to meet this claim, was born a UTF-8.
The UTF-8 is a very awkward encoding, represented in a variable length and compatible with ASCII, where ASCII characters are expressed in 1 byte. But there must be loss, in the UTF-8, the East Asian text is represented in three bytes, including Chinese, some not commonly used characters are expressed in four bytes. As a result, the cost of storage in other countries has increased, while that in the United States has reduced. Once again, I am able to satisfy myself. But there is no way. Who is the boss of the computer industry?
What is BOM?
When a text editor wants to open a file, it indicates that it is forced. There are so many codes in the world. What encoding should I use to decode? You have to tell me!
In this case, UTF enters the BOM to indicate encoding. The so-called BOM is the identifier of the file that uses encoding. It is the same as the python encoding declaration and tells the text editor what encoding I use. You can decode the code below.
Similarly, only the text editor reads the BOM description at the beginning of the file to make the correct interface.
The following is a summary of BOM:
Bom_utf8' \ xef \ xbb \ xbf'
BOM_UTF16_LE '\ xff \ xfe'
BOM_UTF16_BE '\ xfe \ xff'
Similarly, for the encoding of the edited files to be correctly recognized, we also need to write the BOM, which is generally completed by the editor. But you can do it without writing it. You can only manually select what to decode when opening the file.
However, there is also a kind of UTF-8 without BOM mode, what is this ghost.
Because UTF-8 is so popular, the text editor will first decode it with a UTF-8 by default. Even if it is saved with an ANSI (MBCS) notepad by default, the UTF-8 test encoding is used when reading the file, and if it can be decoded successfully, the UTF-8 decoding is used. The awkward practice of notepad leads to a BUG: If you create a text file and enter "Audio Encoding", save it with ANSI (MBCS), and then open it, it becomes "Han ".)
Next we will use a picture to summarize:
At this time, some people will be confused between MBCS and UCS-2, we are two bytes, what is the difference?
MBCS are expanded separately, that is, it is very likely that the same binary represents different results for MBCS, while Unicode represents a unified expansion, ensuring that each binary representation corresponds to a unique character, it ensures uniqueness and improves compatibility.
OK. After talking about the character encoding problem, let's take a look:
# Coding: What does the coding declaration such as gbk and # coding = UTF-8 mean for python.
Here is a tip:
# Coding: UTF-8 or so # The coding = UTF-8 declaration method will report an error. Here, it is not a problem of "=" or ":", but a space problem, there cannot be spaces between coding and symbols, but there are 0 or multiple spaces between the symbols and UTF-8 encoded names, # There are also 0 or multiple spaces between coding and coding. I don't know why, but it is actually an error.
#! /Usr/bin/env python # coding = utf-8print 'China'
Here, coding and = have a space:
An error is reported.
#! /Usr/bin/env python # coding = utf-8print 'China'
There is no space between coding and =:
Normal execution.
I don't know whether it is my IDE or whether python's original syntax is as defined in this way, but in fact there are few places to talk about the syntax in this place. So I will mention it here and try it myself.
#-*-Coding: the same is true for UTF-8.
Now, let's go to the topic:
#! /Usr/bin/env python # coding = utf-8print 'Chinese' print str ('China') print repr (U' ')
Here, by the way, we will explain the difference between the str () function and the repr () function in creating or converting strings. str () isGood readability for humansAnd print is the default call to this function. While repr () is a string created or convertedBetter machine readabilityThe encoding of the string.
The output of another encoding is as follows:
#! /Usr/bin/env python # coding = gbkprint 'Chinese' print str ('China') print repr (u'china ')
The first two are garbled, but here is my IDE problem, my IDE is by default UTF-8.
Change it to GBK and try again.
Yes.
Here we introduce another question: what is the file storage encoding and what is the code running encoding.
No, so try again.
In this way, you can.
In fact, the storage Encoding, that is, the Encoding set by IDE Encoding, is when the disk is saved and opened. Assume that I am using the code written in the windows text editor, that is, the code saved in ANSI. It is irrelevant to running it if I open it with another code. The Running code affects the encoding of the python interactive interface. I mentioned this problem in the first python program. I run a python file in windows cmd, why can I declare in python that UTF-8 can be used in Chinese, but garbled characters are still displayed in the output? This is because, although the UTF-8 encoding output by python, the cmd guy uses GBK to decode it by default.DisplayGarbled on. However, this is not a real garbled code. Simply use a UTF-8-supported display environment.
Well, there are more questions, but these small mistakes may be hard to solve for a long time, so it is better to mention them here.
Next, let's go to the topic and see the following phenomena:
UTF-8:
Gbk:
We can find that the encoding of the string varies with the python encoding Declaration, so we come to a conclusion:
The encoding declaration in python affects the encoding of common strings. It is also a string created using the engineering function str () or simple quotation marks. It varies with the encoding declaration.
Now let's look at this phenomenon again:
It is found that Unicode strings are the same in the above cases, because they are always encoded using Unicode, no matter what you declare.
Create a Unicode string, that is, create a Unicode object. You can use the factory function unicode () or add a u before the quotation mark.
Therefore, we can come up with a rule for encoding and conversion in python:
Further overall expansion:
As long as python supports character sets, such logic can be used for encoding and conversion within python.
As long as you know this, you can avoid many errors in file processing. Let's talk about python File Processing in the next article.