I. Premises
So what is coding anyway? //ASCIIremember the bottom line: all the data in your computer, whether it's text, pictures, videos, or audio files, is essentially a similar01010101Binary storage. Besides, the computer only knows the binary numbers! So, the purpose is clear: how can we identify the symbol unique to a set of binary numbers corresponding to? So the comrade of the United States thought that by means of a level of high and low State to refer to 0 or 1, eight levels as a group can be expressed in 256 different states, each state is the only one character, such as a--->00010001, and the English only 26 characters, count some special characters and numbers, 128 states are enough, each level is called a specific, the Convention 8 bits constitute a byte, so that the computer can use 127 different bytes to store English text. This is ASCII encoding. Extended ANSI encoding just said, at first, a byte has eight bits, but the top is useless, the default is 0, and later for the computer can also represent Latin, the last one is also used, from 128 to 255 of the character set corresponds to Latin. At this point, a single byte is full! //GB2312when the computer crossed the sea to China, the problem came, the computer does not know Chinese, of course, can not display Chinese; and a byte all states are occupied, the evil imperialism died my heart not dead! Our party is also a good, self-reliance, self-rewrite a table, directly vigorous the expansion of the eighth bit corresponding to the Latin all deleted, the meaning of a character less than 127 is the same as the original, but two more than 127 words connect prompt together, it represents a Chinese character, The previous byte (which he called the high byte) is used from 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, so that we can assemble about 7,000 more Simplified Chinese characters, which is called "GB2312". GB2312 is a Chinese extension to ASCII. //GBK and GB18030 codesBut there are too many Chinese characters, GB2312 is not enough to use, so as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set in the content. The result of the expanded coding scheme is called the GBK Standard, and GBK includes all the contents of the GB2312, while adding nearly 20,000 new Chinese characters (including traditional characters) and symbols. //Unicode Encoding:Many other countries have developed their own coding standards, but they do not support each other. This brings a lot of problems. Therefore, the international standard who organization for unified coding: The standard coding criteria: UNICODE. Unicode is represented by two bytes as a single character, it can combine 65535 different characters in total, which is enough to cover all the symbols in the world (including Oracle)//UTF8:Unicode is Eminence, why do you have a UTF8 code? People think, for the English world, a byte is completely enough, such as to store A, originally 00010001 can be, now eat the Unicode same big pot, with two bytes:0000000000010001, waste is too serious! Based on this, American scientists have put forward the idea of genius: UTF8. UTF-8(8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that can be used 1~4 bytes represents a symbol, varying the length of a byte depending on the symbol, and when the character is in the ASCII range, it is expressed in one byte, so it is ASCII-encoded. The obvious benefit is that while the data in our memory is Unicode, it is far less utf8 to use Unicode directly when data is saved to disk or used for network transmission! This is also why UTF8 is our recommended coding method. The relationship between Unicode and UTF8: Word: Unicode is a memory-encoded representation scheme (which is a specification), and UTF is a scheme for how to save and transmit Unicode (implementation), which is the difference between UTF and Unicode.
We just need to know that Unicode is the universal code, and Utf-8 and GBK are expanded by Unicode, so we utf-8 to GBK is the form: first turn to Unicode and then to GBK; other encodings convert to Utf-8, and then to Unicode. , then turn into Utf-8.
The Python3 string is Unicode encoded, text is always Unicode, represented by the STR type, and binary data is represented by the bytes type
One of the easy-to-misunderstand places:
(1) py3 The default file encoding is utf-8, so you can directly write in Chinese, do not need the file header declaration code
(2) The variable you declare is Unicode encoding by default, even if your file header declaration code is utf-8, not utf-8, because the default is Unicode.
Import Sys
Print (Sys.getdefaultencoding ()) #打印系统的默认编码
#s =u "Hello" #加一个u表示这个你好是upython默认是unicodenicode的编码格式, the output results found to be the same as no u, which means that Python is Unicode by default
s= "Hello" #, that is, this s is Unicode encoding format
Print (S.encode ("Utf-8")) #encode解码 that tells the system S to convert to Utf-8, outputting an object of type byte
Print (S.encode ("Utf-8"). Decode ("GBK")) #表示再向gbk格式转码
Print (S.encode ("GBK"))
Print (S.encode ("Utf-8"). Decode ("Utf-8"). Encode ("GBK"))
Print (S.encode ()) #默认不写表示utf-8
Print (S.encode ("Utf-8"). Decode ())
>>>utf-8
>>>b'\xe4\xbd\xa0\xe5\xa5\xbd'#b' denotes byte type
>>>>>>b'\xc4\xe3\xba\xc3'
>>>b'\xc4\xe3\xba\xc3'
>>>b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> Hello
Python's handy----character encoding and transcoding