Before the formal explanation, let's give you a reference: poke here
The contents of this article refer to this information, and to summarize, in order to avoid my summary is not perfect, or say what wrong place, there are questions about where you can see the above article.
Let's start with the coding problem in Python, first of all, we'll see what the coding is.
1. ASCII
ASCII is a byte representation of a character, and a byte is composed of eight-bit binary, so it can produce 2**8=256 changes, in the age of the computer was born, used to denote the case of 26 English letters, plus some symbols and so on or more than enough. This is also the default encoding used in python2.x, so Chinese is not available by default in python2.x unless you use an encoding declaration.
2. MBCS
With the development of the Times, the ASCII is too enough, the computer can not only display English bar, that is too low. At this time, we see that there is no use of ASCII code, so all want to occupy the rest of the part, but the rest of the part is not enough, such as our Chinese is certainly not enough to use, so at this time expanded a bit, a byte can not, I use two. And in order to be compatible with ASCII code, there is a defined rule: if the first byte is \x80, then the ASCII character is still represented, and if it is above \x80, the next byte (a total of two bytes) is represented by one character, and then the next byte is skipped, continuing to judge. For example, GB ... And big ... The same kind of rules are followed.
However, this is still too messy, at this time IBM jumped out, shouting: these things must be unified management!! So the concept of the code page, the character sets are collected, and the paging, and the general name of these paging is called MBCS, such as GBK on 936 pages, so called cp936. And everyone is using the double byte, so also called DBCS.
But obviously, MBCS inside collection and all kinds of character sets, but you can not say you want to use MBCS this character set encoding, inside how many, in the end is to use which, you do not say clearly I can not randomly give you a kind of it. So it has to be specified, but the work is done by the operating system itself (Linux is not supported), and the operating system is sometimes selected depending on the region. For example, in the Simplified Chinese version, the choice of GBK, other countries will be different, depending on the version. So, once the Mbcs/bdcs is used in Python's coding statement, an error is unavoidable when the system or cross-region is running. Therefore, the code declaration must be specific, such as our common utf-8, so that the system and regional differences will not cause the various coding errors.
In Windows, Microsoft has an alias for it, called ANSI, in fact, is MBSC, you know it is good.
3.Unicode
Although MBSC to some extent solves the problem of coding confusion, it is still characteristic that the encoding can only display character characters. This makes it very difficult to develop a program that is adaptable to multiple languages, when people wonder if there is a character that can be encoded. After you have studied this, Unicode is born. Simply do not expand on the ASCII to expand, make the various versions so chaotic. In the future, we all use two bytes to save it, so there is a 256*256=65536 kind of character can be expressed, it is enough. This is the UCS-2 standard. Later, some people say that not enough, then simply doubled, with four bytes, 256**4=4294967296, even if the future indicates that the alien text can support for some time. Of course, the UCS-2 standard is still used now.
The UCS (Unicode Character Set) is also just a table of characters corresponding to the code bit (that is, bytes), such as the code bit of the word "Han" is 6c49. How the characters are transferred and stored is the responsibility of UTF (the UCS transformation Format), which is to save the bytes. (Note: The byte ≠ is saved, that is, although I use 2 bytes to represent the character, I do not necessarily save the byte that is used to represent it.)
The first is to use the UCS code to save directly, this is UTF-16, for example, "Han" directly using the \x6c\x49 Save (utf-16-be), or upside down using \x49\x6c Save (utf-16-le). But the Yankees later did not want to, I originally used ASCII only 1 bytes can be found, but now to two bytes, a full length of time ah. What is the concept of one-fold, rounding that is nearly 100 million AH . really when I do not have the disk space money Ah, in order to meet this request, the birth of UTF-8.
UTF-8 is a very awkward code that behaves in a way that is longer and compatible with ASCII,ASCII characters using 1-byte representations. But there must be a loss, in UTF-8, East Asian characters are expressed in three bytes, including Chinese, some of the less commonly used character is expressed in four bytes. So the cost of saving in other countries is higher, and the Americans are getting lower. Again the pit of others, satisfied themselves. But no way, who is the boss of the computer industry?
What is a BOM
When a text editor opens a file, it says it's crazy. What kind of coding do I use to decode the world so many codes? You gotta tell me!
At this point, UTF enters the BOM to represent the encoding. The so-called BOM is the file using the encoded identifier, just like the Python code declaration, tell the text editor I use what encoding, the following you are using that code to decode the line.
Similarly, only the text editor can read the description of the BOM at the beginning of the file and be able to make the correct interface.
Here is a summary of some of the BOM:
Bom_utf8 ' \XEF\XBB\XBF '
Bom_utf16_le ' \xff\xfe '
Bom_utf16_be ' \xfe\xff '
Similarly, for our own editing of the file encoding can also be recognized correctly, we also have to write to the BOM, generally by the editor to complete. But can not write, only when the file is opened by themselves manually choose what to decode is also possible.
However, there is also a kind of called UTF-8 no BOM mode, this is what ghost.
Because UTF-8 is so popular, the text editor is first decoded with UTF-8 by default. Even if you use the ANSI (MBCS) Notepad by default on save, the UTF-8 test encoding is used first when reading the file, and if it can be decoded successfully, use UTF-8 decoding. Notepad this awkward practice caused a bug: If you create a new text file and enter "Cha 塧" and then use ANSI (MBCS) to save, then open it will become "Han a". )
Use a picture to summarize:
At this point, some people will be confused between MBCS and UCS-2, everyone is two bytes, what is the difference?
MBCS is a separate extension, there is a likelihood that the same binary implies that MBCS will have different results, and Unicode is a unified extension, ensuring that each binary representation corresponds to a unique character, ensuring uniqueness, and improved compatibility.
OK, after the question of character encoding, let's look again:
# coding:gbk and # coding= utf-8 What it means to Python.
Here's a little tip:
# coding:utf-8 or so # coding = utf-8 declaration method will be error, here is not to say is characteristic = Or: The problem, but the question of the space, between the coding and the symbol can not have spaces, but in the symbol and utf-8 such as the encoding name is run 0 or more spaces, #和coding间也是运行0个或多个空格的. I do not know why, but the actual error is.
# !/usr/bin/env python # coding = Utf-8 Print ' English '
Here Coding and = number a space:
An error has been made.
# !/usr/bin/env python # coding= Utf-8 Print ' English '
No spaces between coding and =:
Normal execution.
It's not clear if it's my IDE's problem or Python's original syntax, but there are few places to talk about the syntax of this place, so mention it here and experiment with it.
# -*-coding:utf-8-*- is the same.
OK, let's get to the point:
# !/usr/bin/env python # coding= Utf-8 Print ' English ' Print str (' Chinese ')print repr (' Chinese ')print repr (u' Chinese ')
Here, by the way, explain the difference between the STR () function and the repr () function in creating or converting strings, and STR () is a better-readable string for humans , and print is the default call to this function. and Repr () is a string that is more readable to the machine when it is created or converted, which is where it is encoded.
The following is an output of another encoding:
# !/usr/bin/env python # coding= GBK Print ' English ' Print str (' Chinese ')print repr (' Chinese ')print repr (u' Chinese ')
The first two is garbled, but here is my IDE's problem, my IDE by default is UTF-8.
Change here to GBK and try again.
It's ready again.
Here again a question is introduced, what is the file save code, what is the code run encoding.
Or not, then change it a bit.
It's all right again.
In fact, save the encoding, that is, the IDE encoding settings are stored and opened on the disk when the encoding. Suppose I am using the text editor of Windows write code, that is, with the ANSI save, how I again with the other code open will be garbled, this is irrelevant to run things. While running the code is the code that affects the Python interface, I also said this in the first Python program, I ran a python file with cmd in Windows, why I declared in Python that I can use Chinese utf-8, But in the output display, or there is garbled? This is because although the python output of the Utf-8 encoding, but the cmd this guy by default is to decode with GBK, so there is a garbled display . But this is not the real garbled, swap with a support utf-8 display environment can be.
OK, the digression is a little bit more, but these small mistakes may be tangled up for a long time can not solve, so here is a mention of the good.
Below to get to the point, see the following phenomenon:
Utf-8:
Gbk
We can see that with the Python code declaration different, the string encoding is different, so we come to a conclusion:
The encoding declaration in Python affects the encoding of ordinary strings, and is a string created using the Engineering function str () or simple quotation marks. is different from the code declaration.
Now look at this phenomenon:
A Unicode string is found to be the same in all cases, because it is always encoded in Unicode, regardless of what you declare.
The creation of a Unicode string, that is, the creation of a Unicode object can use the Factory function Unicode () or precede the quotation marks with a U.
So, we can come up with a code conversion rule in Python:
Then the overall expansion:
As long as the Python-supported character set, it is possible to use such logic in Python's internal encoding conversion.
As long as you know this, you can avoid a lot of mistakes in file processing, and we'll talk about Python files in the next chapter.
19.python Coding Issues