I. What is character encoding
The computer must be energized to work, that is, ' electricity ' drives the computer to work, and the ' power ' is characterized by high and low voltages (i.e., binary number 1, low voltage i.e. binary number 0), which means that the computer only knows the numbers
The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work
So you have to go through a process:
String--------(translation process)-------> numbers
This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.
Two. Character Encoding classification
The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols
Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.
And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters
So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.
The Japanese have set their own shift_jis codes.
Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)
At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.
So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)
ASCII uses 1 bytes (8-bit binary) to represent one character
Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes
Cases:
The letter x, denoted by ASCII is decimal 120, binary 0111 1000
Chinese characters are 中
beyond the ASCII encoding range, Unicode encoding is decimal 20013
, binary 01001110 00101101
.
The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard
This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient
In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8
encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:
character |
ASCII |
Unicode |
UTF-8 |
A |
01000001 |
00000000 01000001 |
01000001 |
In |
X |
01001110 00101101 |
11100100 10111000 10101101 |
It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.
Three. Character Encoding conversion Relationship 3.1 program operation principle
First need to be clear, the computer components work together, data transmission is a binary form, in the computer, there is no text, everything is a binary number, computer operation mainly rely on
CPU: Remove binary command execution from memory
Memory: Extracting binary data from hard disk for CPU operation
Hard disk: Storing human-recognized text in binary form on disk
Files and program files are special files, read the contents of the file or the operation of the program need to
- Programmers develop programs that eventually write a bunch of human-defined text symbols that are considered meaningful by humans, saved in binary form to the hard disk
- program run, the operating system from the hard disk to find the location of the program code, read the binary to the memory
- The Python interpreter reads the binary from memory, interpreting the execution
3.2 Ultimate Revelation
First we define a memory variable in the terminal: name= ' lamb ', that memory variable is stored in memory (necessarily binary), so a code is required, Unicode (fixed in memory using character encoding is Unicode)
But if we write to the file Name= ' lamb ' saved to the hard disk (necessarily binary), also need a code, which is related to each country, if we use GBK, then the file is saved to the hard disk in GBK form.
Program to run: Hard disk binary (GBK)---> memory binary (Unicode)
That is, all programs eventually have to be loaded into memory, the program is saved to the hard drive in different countries in different encoding formats, but into memory we are in order to be compatible with all nations (the computer can run any country's program because of this), unified and fixed using Unicode, This is why memory is fixed with Unicode, you may say compatible with all nations I can use utf-8 ah, can, fully functional, the reason is not sure that Unicode is more efficient than utf-8 AH (uicode fixed with 2 byte encoding, utf-8 need to calculate), But Unicode is a waste of space, yes, this is a way to use space for time, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8, because the transmission of data, the pursuit of stability, efficiency, the smaller the amount of data transmission is more reliable, They are then converted to UTF-8 format instead of Unicode.
Gbk->unicode need decode (), UNICODE->GBK need encode () that's what this means.
When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser
Python program is very special, want to run the Python interpreter to call, it is equivalent to the Python interpreter to read the memory of the program's Unicode code
Thus the Python interpreter also has an interpreter default encoding can be viewed with sys.getdefaultencoding (), if not the Python file specifies header information #-*-coding:utf-8-*-, then use the default
Note that this code, the Python interpreter, is the code for this software.
3.3Add
aside programming, we write a file separately, save to the hard disk, also need to have character encoding Ah, the process is as follows
First of all, we have to edit the document, you can not control the high and low voltage to the hard disk to write the binary bar, there is a software, software is a running program Ah, you write the content is run in memory software to operate, so like the following data you do not point to save, is still in memory (some software will be automatically saved in a few seconds), if the power outage at the moment, the data is certainly not ah, that is to say, the following data are actually saved in memory, is the Unicode format.
Pycharm Nature is also the same as word ah? It's all software that handles files.
However, if you modify the file saved encoding, that is, the hard disk encoding using GBK, after committing the operation, save the above file, then the file is saved in GBK format to the hard disk
Then close the pycharm, we reopen, the file default encoding open code if it is Utf-8, then it must be a mess, because hard disk (GBK)---> Memory (Unicode) <---pycharm (use utf-8 to read)
Modify to read in GBK mode
Summary
In fact, whether it is word or pycharm, Python interpreter, we can when they are processing files software
Attention:
- python2.7 interpreter is encoded as ASCII by default
- PYTHON3.5 interpreter is encoded as UTF-8 by default
- Whether it's a Python interpreter or any other text-related software, they can only specify the character encoding that accesses the file to the hard disk, and the memory is fixed using the Uniccode
- The header of the test.py file is the #-*-coding:utf-8-*-that modifies the Python interpreter's encoding.
Process:
- The binary bytes type data from the hard disk read-test.py is loaded into memory, and now the Python interpreter is a class of word software, and the Python interpreter has its own encoding to decode the binary of the file into Unicode into memory
- The Python interpreter reads Unicode code interpretation execution from memory, and the code specified by the function does not have any relation to the Python interpreter.
The original address is: http://www.cnblogs.com/linhaifeng/articles/5950339.html
Python base character encoding (i)