First understand the history, but this article jumbled, such as old lady binding cloth----------smelly and long
Coding history:
1. The computer can only process numbers, and text files are converted to numbers only
To process. 8bit==1 bytes So one word energy-saving representation of the largest number is 255
2. Americans invent computers. In English, all characters are represented by a single byte
ASCII (one byte) encoding is the American Standard Code.
3. When the Chinese use the computer, they need to express Chinese characters, so it is clear
The encoded format of the GB2312, which is a two-byte representation of a Chinese character. Similarly, other language countries
The corresponding code is created. Without a common standard, so when different languages are used
No corresponding encoding will produce garbled
4. For uniform standards, Unicode encoding appears, all languages unified to a set of encodings
Comparison between Unicode and ASCII encoding
1) Letter A:ascii Decimal 65, binary is 0100 0001
ASCII in Chinese characters cannot be encoded with Unicode 200,132 binary: 01001110 00101101
2) for the computer to identify the uniform length, so a front position of 0 is 00000000 0100 0001
Standards in this unity
5. Standard unified, garbled problem solved, but Unicode encoding length is longer, but computer English is the main,
If the content is all in English, Unicode encoding is more than the ASCII encoding more storage space, while the transmission is also one times more
How to solve it?
6. If the Unicode encoding can change, then UTF-8 appears.
Utf-8, letters A byte, a Chinese character 3 bytes, particularly uncommon in 4-6
This saves space and storage
7, then the problem is: the computer only recognize Unicode encoding
How to convert between Utf-8
When it needs to be recognized by the computer, it is loaded into memory, and the encoding used must be Unicode encoded
Use UTF-8 encoding when you need to transfer over the network, or when stored in a file, in order to save space costs
So there's a mutual transformation.
Python2 and Python3 on Windows/linux encoding conversion python2: in Windows:
1. First look at what the window itself is coded
Import Sys
Sys.getdefaultencoding ()
#out: "Utf-8"
2. String all in English
s1= "ABC"--Type (S1): Str
S2 = u "abc"--Type (s2): Unicode
Meaning of U "": Indicates that the following string is stored in Unicode format
S1.encode ("UTF8") success
S2.encode ("UTF8") success
3. When Chinese is present:
S1 = "Hello"--GB2312 encoded. Windows under
S2 = u "Hello"
S1.encode ("UTF8") error
S2.encode ("UTF8") success
Cause of Error:
is encoded in memory in Unicode, but
S1 is not Unicode encoded (because of wasted storage) when it passes over, and
Encode is encoded by converting a Unicode object into an encoded format in a parameter
So S2 won't get an error.
Workaround:
First, convert this gb2312 code to Unicode-encoded objects
And then into the utf-8.
S1.decode ("gb2312"). Ecode ("UTF8") succeeded under windows as "gb2312"
Decode ("XX") method is to convert an object encoded as "XX"
As a Unicode object
Under Linux:
1. First look at what the Linux code itself is
Import Sys
Sys.getdefaultencoding ()
#out: "ASCII"
2. String all in English
s1= "ABC"--Type (S1): Str
S2 = u "abc"--Type (s2): Unicode
Meaning of U "": Indicates that the following string is stored in Unicode format
S1.encode ("UTF8") success
S2.encode ("UTF8") success
3. When Chinese is present:
S1 = "Hello"--utf-8 encoded. Why is it not ASCII under Linux? Does the ASCII mean Chinese?
It must have been converted for utf-8.
S2 = u "Hello"
S1.encode ("UTF8") error
S2.encode ("UTF8") success
Workaround:
First, convert this utf-8 code to Unicode-encoded objects
And then into the utf-8.
S1.decode ("UTF8"). Ecode ("UTF8") successful Linux Chinese as "Utf-8"
Equivalent to S1 and back again, itself is the Utf-8 code
Python 3:
In Python3, all STR types are encoded in Unicode format and can be encode directly to "Utf-8"
In Windows:
1. String All in English
s1= "ABC"--Type (S1): Str
S2 = u "abc"--Type (s2): Unicode
Meaning of U "": Indicates that the following string is stored in Unicode format
S1.encode ("UTF8") success
S2.encode ("UTF8") success
2. When Chinese is present:
S1 = "Hello"--Unicode encoding. Windows under
S2 = u "Hello"---> No need to write this, no U "", 3 also think this is Unicode
S1.encode ("UTF8") success
S2.encode ("UTF8") success
Under Linux: As in window
Summary: Talk about #-*-coding:utf-8-*-
Python2 with 3 the largest difference:
2 When the file has Chinese appearance must be added at the beginning, and the character string must be added U ""
Role:
Tell Python that this file is encoded in UTF-8 format and will be interpreted according to this code.
The Unicode conversion is then done internally
Why not write in 3:
3 Python will interpret the file in Unicode
3
Python-based------python coding