Character encoding
Because a computer can only handle numbers, if you are working with text, you must convert the text to a number before processing it. The oldest computer was designed with 8 bits (bit) as a byte (byte), so a single word energy-saving representation of the largest integer is 255 (binary 11111111 = decimal 255), if you want to represent a larger integer, you must use more bytes. For example, two bytes can represent the largest integer is 65535, 4 bytes can represent the largest integer is 4294967295.
Since the computer was invented by the Americans, only 127 letters were encoded into the computer, that is, letters, numbers, and symbols, which are referred to as ASCII encoding, such as the code for capital A is 65, and the lower case z is encoded as 122.
The problem of Chinese coding has always been a headache for programmers. The default encoding for Python2 is ASCII, and the default encoding for Python3 is UTF-8
First, Python2 and Python3 Universal encoding method
1, UTF-8 encoding method:
English: A 00100000 8 bits 1 bytes
Chinese: Medium 00000001 00000010 00001110 24 bit 3 bytes
2, GBK encoding method:
English: A 00000110 8 bits 1 bytes
Chinese: Medium 00000010 00000110 16 bit 2 bytes
The binary between the encodings is not mutually identifiable, creating confusion
Storing, transmitting, and not using Unicode code (because the space occupied by the file is too large), can only be used UTF-8, UTF-16, GBK, GB2312, ASCII code
3, str in Python is Unicode code, there is a type of bytes
1. English
Str:
Representation s = ' Alex '
Encoding mode 0101010101 Unicode
bytes
Representation s = B ' Alex '
Encoding method 00101010 utf-8 gbk
2. Chinese
Str:
Presentation s= ' China '
Encoding method 01010110 utf-8 gbk
bytes :
Presentation mode B ' X\e91\e91\e01\e21\e31\e32
Encoding method 01001100 utf-8 gbk
Character encoding for Python