Python base character encoding (i)

Last Update:2017-04-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. What is character encoding

The computer must be energized to work, that is, ' electricity ' drives the computer to work, and the ' power ' is characterized by high and low voltages (i.e., binary number 1, low voltage i.e. binary number 0), which means that the computer only knows the numbers

The purpose of programming is to let the computer work, and the result of programming is simply a bunch of characters, that is to say, what we are programmed to achieve is: a bunch of characters drive a computer to work

So you have to go through a process:

String--------(translation process)-------> numbers

This process is actually how a character corresponds to the standard of a particular number, which is called a character encoding.

Two. Character Encoding classification

The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols

Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.

And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters

So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.

The Japanese have set their own shift_jis codes.

Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)

At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.

So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)

ASCII uses 1 bytes (8-bit binary) to represent one character

Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes

Cases:

The letter x, denoted by ASCII is decimal 120, binary 0111 1000

Chinese characters are 中 beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .

The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard

This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient

In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character	ASCII	Unicode	UTF-8
A	01000001	00000000 01000001	01000001
In	X	01001110 00101101	11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Three. Character Encoding conversion Relationship 3.1 program operation principle

First need to be clear, the computer components work together, data transmission is a binary form, in the computer, there is no text, everything is a binary number, computer operation mainly rely on

CPU: Remove binary command execution from memory

Memory: Extracting binary data from hard disk for CPU operation

Hard disk: Storing human-recognized text in binary form on disk

Files and program files are special files, read the contents of the file or the operation of the program need to

Programmers develop programs that eventually write a bunch of human-defined text symbols that are considered meaningful by humans, saved in binary form to the hard disk
program run, the operating system from the hard disk to find the location of the program code, read the binary to the memory
The Python interpreter reads the binary from memory, interpreting the execution

3.2 Ultimate Revelation

First we define a memory variable in the terminal: name= ' lamb ', that memory variable is stored in memory (necessarily binary), so a code is required, Unicode (fixed in memory using character encoding is Unicode)

But if we write to the file Name= ' lamb ' saved to the hard disk (necessarily binary), also need a code, which is related to each country, if we use GBK, then the file is saved to the hard disk in GBK form.

Program to run: Hard disk binary (GBK)---> memory binary (Unicode)
That is, all programs eventually have to be loaded into memory, the program is saved to the hard drive in different countries in different encoding formats, but into memory we are in order to be compatible with all nations (the computer can run any country's program because of this), unified and fixed using Unicode, This is why memory is fixed with Unicode, you may say compatible with all nations I can use utf-8 ah, can, fully functional, the reason is not sure that Unicode is more efficient than utf-8 AH (uicode fixed with 2 byte encoding, utf-8 need to calculate), But Unicode is a waste of space, yes, this is a way to use space for time, and storage to the hard disk, or network transmission, all need to turn Unicode into utf-8, because the transmission of data, the pursuit of stability, efficiency, the smaller the amount of data transmission is more reliable, They are then converted to UTF-8 format instead of Unicode.

Gbk->unicode need decode (), UNICODE->GBK need encode () that's what this means.

When you browse the Web, the server converts dynamically generated Unicode content to UTF-8 and then to the browser

Python program is very special, want to run the Python interpreter to call, it is equivalent to the Python interpreter to read the memory of the program's Unicode code

Thus the Python interpreter also has an interpreter default encoding can be viewed with sys.getdefaultencoding (), if not the Python file specifies header information #-*-coding:utf-8-*-, then use the default

Note that this code, the Python interpreter, is the code for this software.

3.3Add

aside programming, we write a file separately, save to the hard disk, also need to have character encoding Ah, the process is as follows

First of all, we have to edit the document, you can not control the high and low voltage to the hard disk to write the binary bar, there is a software, software is a running program Ah, you write the content is run in memory software to operate, so like the following data you do not point to save, is still in memory (some software will be automatically saved in a few seconds), if the power outage at the moment, the data is certainly not ah, that is to say, the following data are actually saved in memory, is the Unicode format.

Pycharm Nature is also the same as word ah? It's all software that handles files.

However, if you modify the file saved encoding, that is, the hard disk encoding using GBK, after committing the operation, save the above file, then the file is saved in GBK format to the hard disk

Then close the pycharm, we reopen, the file default encoding open code if it is Utf-8, then it must be a mess, because hard disk (GBK)---> Memory (Unicode) <---pycharm (use utf-8 to read)

Modify to read in GBK mode

Summary

In fact, whether it is word or pycharm, Python interpreter, we can when they are processing files software

Attention:

python2.7 interpreter is encoded as ASCII by default
PYTHON3.5 interpreter is encoded as UTF-8 by default
Whether it's a Python interpreter or any other text-related software, they can only specify the character encoding that accesses the file to the hard disk, and the memory is fixed using the Uniccode
The header of the test.py file is the #-*-coding:utf-8-*-that modifies the Python interpreter's encoding.

Process:

The binary bytes type data from the hard disk read-test.py is loaded into memory, and now the Python interpreter is a class of word software, and the Python interpreter has its own encoding to decode the binary of the file into Unicode into memory
The Python interpreter reads Unicode code interpretation execution from memory, and the code specified by the function does not have any relation to the Python interpreter.

The original address is: http://www.cnblogs.com/linhaifeng/articles/5950339.html

Python base character encoding (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python base character encoding (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python base character encoding (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support