1. Various coding methods
ascii:http://zh.wikipedia.org/zh-hans/ASCII unicode:http://zh.wikipedia.org/zh-hans/Unicode UTF -8:http://zh.wikipedia.org/zh/utf-8 Gbk:http://zh.wikipedia.org/zh/%e6%b1%89%e5%ad%97%e5%86%85%e7%a0%81%e6 %89%a9%e5%b1%95%e8%a7%84%e8%8c%83
gb_2312:http://zh.wikipedia.org/zh/gb_2312
2. The origins of various coding methods
1. Encoding: In the computer, All data is stored and computed using a binary number representation (because the computer represents 1 and 0, respectively, with high and low levels). Specifically which binary numbers to indicate which symbol, of course everyone can contract their own set (this is called coding), and if you want to communicate with each other without causing confusion, then we must use the same coding rules, in Is the United States, the standardization of the introduction of the ASCII code, Uniform rules of the above-mentioned symbols with which binary numbers to represent.
for Information Interchange): It is well known that the computer was invented by Americans, so the formulation of ASCII was done by the Americans, so ASCII was made to show modern American English. These include: 26 Basic Latin letters, Arabic numerals and English punctuation marks.
3. Gb2312:ascii can only solve the information exchange needs of the Americans, the Chinese language as a means of communication tools to develop their own coding, to solve the requirements of information exchange. GB2312 is such a coding method, it is the national standard of the People's Republic of China Simplified Chinese character set, the full name of "Information interchange with Chinese character encoding character set • Basic Set".
4.Unicode: There are more than 200 countries and regions in the world, there are dozens of kinds of commonly used language, and countries have developed their own coding standards. For example, Japan: Shift_JIS, South Korea: euc-KR, countries have the national standard
Inevitable conflict, the result is that in multi-language mixed text, the display will be garbled. The production of Unicode is to solve this problem. Unicode unifies all languages into a set of encodings, so there is no more garbled problem.
It is common to use two bytes to represent a character (4 bytes if you want to use very remote characters). Unicode is supported directly by modern operating systems and most programming languages.
Since the Unicode approach solves the conflict, that is, the need to exchange information around the world, why do we have to utf-8 this encoding method? See
5.utf-8 (8-bit Unicode Transformation Format) If the information is basically all in English, Unicode encoding requires more storage space than ASCII encoding, It is not cost-effective to store and transmit. Therefore, in order to save space, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size.
The commonly used English letters are encoded in 1 bytes, the Chinese characters are usually 3 bytes, only the very uncommon characters will be encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, encode it with UTF-8
You can save space.
How Python learns--coding