about the historical evolution of the code, UTF-8 is how to develop, why Windows still keep GBK encoding ...
And so on, online a search a lot of, most of them are forwarded, share after the same content, still can not solve my inner doubts ...
Coding is a matter of egg pain, if not clear, how to mix in China?
I have set a basic world view of coding by looking through multiple documents and in-depth experiments.
基础内容请自行谷歌..废话不多说,直接上干货!!
below with a few simple code snippet, step by step to explain the code in the "compilation" and "Solution" problem!! (Running in Linux)
"Code One":
1 Importsys, Locale2 3s ="Small Armor"4 Print(s)5 Print(type (s))6 Print(Sys.getdefaultencoding ())7 Print(Locale.getdefaultlocale ())8 9With open ("UTF1","W", encoding ="Utf-8") as F:Ten F.write (s) OneWith open ("GBK1","W", encoding ="GBK") as F: A F.write (s) -With open ("Jis1","W", encoding ="Shift-jis") as F: -F.write (s)
The
code is simple, the person who has learned Python should be able to understand what it means ~ ~
Let's look at the results of the operation:
"Code One" run Result:
1 Small Armor 2 <class'str'>3 utf-84 (' en_US'UTF-8')
As you can imagine, it is to print the "small armor" as it is, and then save the "small armor" to 3 files.
(Shift-jis is in Japanese encoded format)
I don't know if this
stupid code is going to happen?
(focus on the back)
Here's what it means to print out the two "Utf-8":
The above utf-8 means: system default encoding
- Note: Do not think of the system as the operating system, here can be understood as Python3 compiler itself
The following utf-8 means: local default encoding
- Note: This is the code for the operating system. (The Windows runtime will become GBK)
Now we look at the contents of the three files of UTF1, GBK1, JIS1, respectively:
UTF1: Small Armor GBK1: С??? JIS1:??? B
questions:
Why UTF1 content is clear, there is no coding problem, and GBK1, jis1 content has been garbled?
Explain:
Because my file is stored with an encoded format other than Utf-8, when I read these two files, I used the default encoding "Utf-8" of the Linux operating system.
Then write to disk not with Utf-8, read out but with Utf-8, of course, can't read out.
(Here you need to understand the true role of encoding)
"Code Two":
1 #CODING=GBK2 Importsys, Locale3 4s ="Small Armor"5 #CODING=GBK6 Importsys, Locale7 8s ="Small Armor"9 Print(s)Ten Print(type (s)) One Print(Sys.getdefaultencoding ()) A Print(Locale.getdefaultlocale ()) - -With open ("UTF2","W", encoding ="Utf-8") as F: the F.write (s) -With open ("GBK2","W", encoding ="GBK") as F: - F.write (s) -With open ("Jis2","W", encoding ="Shift-jis") as F: +F.write (s)
code knot structure is as simple as
But please note: I added a code statement to my head.
Before the code runs, let's guess the results yourself ~ ~ ~
"Code two" Run results:
1 Hao 忕 敳2<class 'Str'>3Utf-84('en_US','UTF-8')5 Traceback (most recent):6File"2", line 15,inch<module>7 F.write (s)8Unicodeencodeerror:'Shift_JIS'Codec can'T encode character'\u704f'In position 0:illegal multibyte sequence
Here's the problem:
1, the code clearly s = "small armour", why became "Hao 忕 敳"??
2. Why has the JIS code failed? (Before at most, there is only garbled problem, not error, then what happened inside it?) )
3. What does "CODING=GBK" mean?
4, I clearly wrote "CODING=GBK" The code statement, why the system code, local default encoding or not changed? (What's the use of my writing?) )
Explain: So many of the above problems, mainly because did not understand the header file "CODING=GBK" code declaration is what meaning!!
1, it means that the Python3 compiler when reading the. py file, what format should I use to "decode" it? It is only related to reading, so when you are sure what format encoding you use for your code editing, you can write the corresponding encoding format to the header file.
(in this model code, I use the Linux default encoding editor, that is, Utf-8, then run in the back, but asked to decode with GBK, nature is too much, there will be s= "small armor" garbled problem)
(we must know that the code is "compilation" and "solution" of the two steps, must be one-to-do to correct decoding! Although we are usually called "coding format", this is somewhat misleading.
In fact the other half is the "decoding format", to consciously distinguish between "compilation" and "solution", we can not like some articles on the internet to confuse the two!! )
2, according to the above explanation should be able to understand that after writing it, and will not change the local, system default encoding.
(The local default encoding is only relevant to the operating system, and Linux is GBK in Utf-8,windows.) )
(The system default encoding is actually the difference between Python3 and Python2, Python3 is Utf-8,python2 is ASCII.) )
3. What are the functions of the above two codes?
Knock on the blackboard, draw the key:
system default encodingMeans:
When the Python3 compiler reads a. py file, if there is no header file encoding declaration, the. py file is decoded by default using "Utf-8". And when calling the Encode () function, the default is "Utf-8" if the argument is not passed. (This has to do with the "encoding" parameter in the Open () function below to make a distinction, very misleading!!! )
Local default encodingMeans:
When you write a python3 program, if you use the
open () function, without giving it an incoming
"Encoding"This parameter, the local default encoding is used automatically. Yes, if you're on a Windows system, it's the default
GBK Format!!!
(This problem has troubled me for a long time, do not say that has been the default utf-8 to everlasting, I changed to win after the frequent breach of faith. So please pay attention here: Linux can not pass "encoding" parameters, and win can not forget ~ ~ ~)
4, again to answer the question of error:
Because our compiler has already used GBK to decode this. py file, so the read out of the variable s has become the "Hao 忕 敳" We see now! So at this time to save S to the disk file, in fact, it is garbled after the "Hao 忕 敳". And in Japanese, there is no such 3 words, so natural feedback said "in the position of the position 0, the code failed."
Now let's look at the contents of the three files for UTF2, GBK2, Jis2, respectively:
utf2 : 灏忕敳gbk2 : 小甲jis2 :
(Is it the same as the result you imagined?? Hey hehe ~ ~)
Problem:
1, why I use "utf-8" to encode the storage, and later with the Linux default "Utf-8" to decode, but there is garbled?
2, why I use "GBK" to encode storage, followed by the Linux default "Utf-8" to decode, obviously encoding, decoding format inconsistent, but can display normally?
Explain:
1, the actual above two problems is the same problem, I believe that the careful classmate already know the problem is where, I have already said very clearly. At this time the variable s has become "Hao 忕 敳", then utf2 this text file is naturally displayed "Hao 忕 敳".
2, and "Hao 忕 敳" This three characters is how to come?
第1步: 小甲(unicode) ---用 "utf-8" 编码---> e5b0 8fe7 94b2 (utf-8编码后的二进制代码)第2步: e5b0 8fe7 94b2 ---用 “gbk” 解码---> " 灏忕敳 " (unicode)(乱码)第3步: “ 灏忕敳 ” --- 用 “ gbk ” 编码---> e5b0 8fe7 94b2 ( 第2步的逆向)第4步: e5b0 8fe7 94b2 ---用 “ utf-8 ” 解码--->
I think the above steps are clear enough ~
3rd, 4 step is the reverse push back, it becomes the normal "small armor"
Read the "coding" and "decoding" the process, your coding problem has been solved more than half!
"Code three":
#Coding=shift-jisImportsys, locales="Small Armor"Print(s)Print(type (s))Print(Sys.getdefaultencoding ())Print(Locale.getdefaultlocale (),"\ n") A= S.encode ("Shift-jis")Print(a)Print(Type (a)) b= A.decode ("Utf-8")Print(b)Print(type (b))Print(A.decode ("GBK") ) with open ("UTF3","W", encoding ="Utf-8") as F:f.write (s) with open ("GBK3","W", encoding ="GBK") as F:f.write (s) with open ("Jis3","W", encoding ="Shift-jis") as F:f.write (s)#Python Learning Group 548377875
The
overall structure of the code is still the same, but the middle of a little extra code, easy to explain ~
"Code three" Run results:
Ranae Redundant Tsukinuke<class 'Str'>UTF-8('en_US','UTF-8') b'\xe5\xb0\x8f\xe7\x94\xb2'<class 'bytes'>Small Armor<class 'Str'>Hao 忕 敳
as you can see here, our variable s has become a "ranae redundant Tsukinuke" (another garbled encoding caused by JIS decoding).
So at this time, I put "ranae redundant Tsukinuke" with "Shift-jis" decoding back and assigned to the variable A, print, you can see A is the normal display of "small armor", which also proves that my above inference is absolutely correct!!
Now we are still looking at the contents of the three files of Utf3, GBK3, JIS3, respectively:
UTF3: Ranae redundant Tsukinuke gbk3:??? IJIS3: Small Armor
(Oops~~ heck, it's such a mess again.)
这里我澄清一下,实际上utf3这个至少还能有文字,这叫乱码。而gbk3那个东西一团黑是什么鬼,是报错,linux的默认编码无法解码gbk3的文件,所以打印地乱七八糟。
Problem:
- Why UTF3 file is garbled, and Gbk3 file is error??
Explain:
- This is because utf-8 differs from the GBK encoding algorithm.
- What we see most often is the Utf-8 decoding error because it is a variable-length encoding, has 1 bytes of English characters, also has 2-byte Arabic, and also has 3 bytes of Chinese and Japanese.
- GBK to English is the use of single-byte encoding (also means compatible with ASCII), while GBK to the Chinese part is to take a fixed length of 2 bytes, the overall encoding range is 8140-fefe, the first byte between 81-fe, the tail byte between 40-fe. So as long as it does not touch the tail byte within 40 characters, will be a brain in accordance with 2 bytes to decode into Chinese. And the Chinese after Utf-8 code, generally is three bytes. When the number of bytes decoded and the number of encoded bytes do not match, it will naturally result in a full garbled situation.
(Thank you for the "King of Exile")
- And utf-8 is strictly defined, a byte of the character high must be 0, three bytes of characters, the first byte of the high is 1110 begins.
- (links to coding algorithms for related utf-8)
至此,代码的示范部分就结束了~~ 码字码得我手酸 ~~~~(>_<)~~~~
At last
Tips
1, all the file encoding format is determined by the editor you use now!! Text that is edited in Windows is sometimes garbled and sometimes normal when the browser resolves the display, because many text editors in Windows use the same encoding format as the operating system by default.
So before the text is stored, be sure to figure out whether we're using utf-8 or gbk!!!.
And when you use the Python open () function, it is the memory of the process interacting with the disk, and the encoding format in this interactive process is using the operating system's default encoding (Linux is utf-8,windows to GBK)
2, believe that the students learn Python often hear, python3 the default code is utf-8. And sometimes, some people say that the default encoding of Python3 is Unicode, then will someone with my beginner when the same silly points not clear the relationship between the two?
- In fact, Unicode is a character set, a character corresponding to the number one by one mapping, because it is encoded in 2 bytes (or 4 bytes, not discussed here), so the space will be larger, generally only used in memory encoding.
- The Utf-8 is intended for the transmission and storage of Unicode. Because it is variable in length, it can save a lot of storage space when you save in English. The transmission time also saves the flow, therefore more "international" ~
Therefore, there is no ambiguity between the above two statements, the process in memory is the expression of "Unicode" encoding, when the Python3 compiler reads a. py file on the disk, is the default "Utf-8", when the process appears in the open (), write () such as the storage code, The default encoding for the operating system is used by default when storage interaction with the disk is required.
I don't know how to become a "gorgeous" Split line ~ ~ ~
打字、排版、整理思路花了近5个小时,若是这篇文章有帮助到你、有给你带来一些对编码的新灵感,希望可以点个赞。
比心~ ?????????????????????????
Company Python big man summed up to the new code principle, read thoroughly understand the Python coding principle