Python basics 1: python Basics
First understand the history, but this article is redundant, such as the old lady wrapped in cloth ---------- smelly and long
Encoding history:
1. The computer can only process numbers. text files can only be converted to numbers.
Can be processed. 8 bit = 1 byte, so the maximum number of energy-saving characters is 255.
2. Americans invented computers. In English, all one byte represents all characters.
The ASCII (one byte) encoding is the standard American code.
3. When Chinese people use computers, they need to represent Chinese characters, so they invented
GB2312 encoding format, that is, two bytes are used to represent a Chinese character. Similarly, other language countries
The corresponding encoding is also created. There is no common standard, so when different languages use
The corresponding encoding will produce garbled characters.
4. For unified standards, Unicode encoding has emerged, and all languages are unified into one set of codes.
Unicode and ASCII encoding comparison
1) Letter A: ASCII decimal 65, binary is 0100 0001
ASCII in Chinese characters cannot be encoded as Unicode 20013 binary: 01001110 00101101
2) for computer recognition, the uniform length, so the front position of a is 0, that is, 00000000 0100
Unified standards
5. The standard is unified, and the garbled problem is solved, but the Unicode encoding length is long, but the computer is mainly English,
If all the content is in English, Unicode encoding doubles the storage space and transmission space.
How can this problem be solved?
6. If Unicode encoding can change, so the UTF-8 appears
In UTF-8, a letter is a byte, and a Chinese character is 3 bytes, especially the uncommon 4-6
Therefore, space and storage are saved.
7. The problem arises: The computer only recognizes Unicode encoding.
How to convert between UTF-8
When it needs to be recognized by the computer, it will be loaded into the memory. The encoding used at this time must be Unicode encoding.
UTF-8 encoding is used when data needs to be transmitted over the network or stored in files, to save space costs
So there is mutual conversion.
Python2 and python3 encoding on Windows/Linux to python2: On Windows:
1. First, let's take a look at the encoding of the window itself.
Import sys
Sys. getdefaultencoding ()
# Out: "UTF-8"
2. All strings are in English
S1 = "abc" --> type (s1): str
S2 = u "abc" --> type (s2): Unicode
U "" indicates that the subsequent strings are stored in unicode format.
S1.encode ("utf8") successful
S2.encode ("utf8") successful
3. When Chinese characters appear:
S1 = "hello" --> GB2312 encoding. Windows
S2 = u "hello"
S1.encode ("utf8") Error
S2.encode ("utf8") successful
Error cause:
The memory is encoded in Unicode,
When s1 is passed, it is not Unicode encoding (because storage is wasted), and
Encode is to convert a Unicode object to the encoding format in the parameter for encoding.
So s2 will not report an error.
Solution:
First, convert the gb2312 encoding to a unicode-encoded object.
Then convert to UTF-8
S1.decode ("gb2312"). ecode ("utf8") is successfully set to "gb2312" in Windows"
The decode ("xx") method is to convert an object encoded as "xx"
Unicode object
In Linux:
1. First, let's take a look at the linux encoding.
Import sys
Sys. getdefaultencoding ()
# Out: "ascii"
2. All strings are in English
S1 = "abc" --> type (s1): str
S2 = u "abc" --> type (s2): Unicode
U "" indicates that the subsequent strings are stored in unicode format.
S1.encode ("utf8") successful
S2.encode ("utf8") successful
3. When Chinese characters appear:
S1 = "hello" --> UTF-8 encoding. Why is it not ascii in Linux? Can ascii be used to represent Chinese characters?
It must have been converted to UTF-8.
S2 = u "hello"
S1.encode ("utf8") Error
S2.encode ("utf8") successful
Solution:
First, convert the UTF-8 encoding into a unicode-encoded object.
Then convert to UTF-8
S1.decode ("utf8"). ecode ("utf8") is successful. In Linux, the Chinese character is "UTF-8"
Equivalent to s1, it is converted back to UTF-8 encoding.
Python 3:
In python3, all str types are encoded in Unicode format, and the encode can be "UTF-8" directly"
In Windows:
1. All strings are in English
S1 = "abc" --> type (s1): str
S2 = u "abc" --> type (s2): Unicode
U "" indicates that the subsequent strings are stored in unicode format.
S1.encode ("utf8") successful
S2.encode ("utf8") successful
2. When Chinese characters are displayed:
S1 = "hello" --> Unicode encoding. Windows
S2 = u "hello" ---> there is no need to write this, without adding u "", 3 also thinks this is Unicode
S1.encode ("utf8") successful
S2.encode ("utf8") successful
In Linux: same as in Windows
Conclusion: talk #-*-coding: UTF-8 -*-
The biggest difference between python2 and 3:
2. When a file contains Chinese characters, it must be added at the beginning, and the Chinese character string must contain u ""
Purpose:
Tell python that the file is encoded in UTF-8 format and python will interpret it according to the encoding,
Unicode conversion is performed internally.
Why do not write in 3:
3. All Python files are interpreted in Unicode.
3