Character encoding Based on Python

Source: Internet
Author: User

Character encoding Based on Python

1. What is character encoding?

If a computer wants to work, it must be powered on. That is to say, 'electric 'drives the computer to work. The characteristic of 'electric' is high voltage (high and low voltage, namely, binary number 1, low voltage, that is, binary number 0 ), that is to say, computers only recognize numbers.

The purpose of programming is to let the computer work, and the programming result is simply a bunch of characters, that is to say, what we ultimately want to achieve in programming is: a bunch of characters drive the computer to work.

Therefore, a process is required:

String -------- (translation process) -------> Number

This process is actually how a character corresponds to the standard of a specific number. This standard is called character encoding.

Ii. character encoding Classification

The computer was invented by Americans and its earliest character encoding was ASCII, which only stipulated the correspondence between English letters and numbers and some special characters and numbers. Up to 8 bits can be used for representation (one byte), that is, 2 ** 8 = 256. Therefore, the ASCII code can only represent up to 256 symbols.

Of course, we can use English in programming languages, and ASCII is enough. However, when processing data, different countries have different languages. Japanese will add Japanese to their programs, chinese will join Chinese.

To express Chinese, a single byte table is used to represent a man. It is impossible to complete the table (even the pupils know more than two thousand Chinese characters). There is only one solution, A byte is represented by a byte larger than 8-bit binary. The more digits it represents, the more changes it represents. In this way, as many Chinese characters as possible can be expressed.

Therefore, the Chinese have defined their own standard gb2312 encoding, and defined the correspondence between character-> number including Chinese characters.

The Japanese have defined their own Shift_JIS encoding.

Koreans have defined their own Euc-kr code (In addition, Koreans say they invented computers and demanded that the world use Korean code in a unified manner)

At this time, the problem arises. Zhou, who is proficient in 18 languages, writes a document in 8 languages modestly. according to the standards of which country, garbled characters will occur (because all the standards at the moment only stipulate the correspondence between characters in their country's text and numbers. If a country's encoding format is used, the text in other countries will be garbled during parsing)

Therefore, unicode came into being because of the urgent need for a world standard (which can contain all the languages of the world) (Korean people expressed their dissatisfaction and were useless)

Ascii represents a character in 1 byte (8-bit binary)

Unicode usually uses two bytes (16-bit binary) to represent one character. For uncommon characters, it must use four bytes.

Example:

The letter x, represented in ascii in decimal 120, binary 0111 1000

The Chinese characters are beyond the ASCII encoding range. The Unicode encoding is 20013 in decimal format and 01001110 in binary format.

The letter x represents binary 0000 0000 0111 1000 in unicode format. Therefore, unicode is compatible with ascii and universal. It is the world standard.

At this time, the garbled problem disappears. We use all the documents, but the new problem arises. If all our documents are in English, unicode will consume twice as much space as ascii, inefficient in storage and transmission

In the spirit of saving, there is a UTF-8 code that converts Unicode encoding to "Variable Length Encoding. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, only uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, using UTF-8 encoding can save space:

Character ASCIIUnicodeUTF-8

A010000000000000 0100000101000001

Medium x01001110 0010110111100100 10111000 10101101

From the above table we can also find that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be seen as part of UTF-8 encoding, so, A large number of legacy software that only supports ASCII encoding can continue working under UTF-8 encoding.

Character encoding Based on Python

Iii. character encoding conversion relationship

3.1 program running principle

First of all, it should be clear that all components of a computer work collaboratively, and data transmission is in binary form. In the computer's view, there is no text, and everything is binary.

Cpu: Extracts binary commands from memory for execution

Memory: Extracts binary data from the hard disk for cpu operation.

Hard Disk: stores the texts that humans know in binary format on disks.

Files and program files are special files. to read the file content or run the program

After a programmer develops a program, he finally writes a bunch of manually defined text symbols that humans think are meaningful and saved to the hard disk in binary format.

The program is running. The operating system finds the location where the program code is stored on the hard disk and reads the binary data to the memory.

The python interpreter reads binary data from the memory and implements interpretation.

3.2 ultimate Secrets

First, we define a memory variable in the terminal: name = 'lin Haifeng '. Then the memory variable is stored in the memory (which must be binary), so we need an encoding, is unicode (the character encoding in the memory is fixed to unicode)

However, if we write the file name = 'lin Haifeng 'and save it to the hard disk (which must be binary), we also need an encoding, which is related to various countries. Suppose we use gbk, then the file is saved to the hard disk in gbk format.

Program to run: Hard Disk binary (gbk) ---> memory binary (unicode)

That is to say, all programs will eventually be loaded into the memory, and the program will be stored in different countries on the hard disk in different encoding formats, however, in the memory, we use unicode in a unified and fixed manner to ensure compatibility with all nations (which is why computers can run programs in any country). That is why unicode is used for fixed memory, you may say that I can use UTF-8 for compatibility with other countries. Yes, it works properly. The reason why unicode is not necessarily more efficient than UTF-8 (uicode is fixed with 2 bytes encoding, UTF-8 needs to be calculated), But unicode is a waste of space. That's right. This is a way to use space for time, and it is stored on the hard disk or transmitted over the network, unicode must be converted to UTF-8. Because data transmission is stable and efficient, the smaller the data size, the more reliable the data transmission is. Therefore, unicode is converted to UTF-8 instead of unicode.

Gbk-> unicode requires decode (), unicode-> gbk requires encode ().

When browsing the Web page, the server will convert the dynamically generated Unicode content into a UTF-8 and then transmit it to the browser

The python program is special. To run the program that requires the python interpreter to call, it is equivalent to the python interpreter to read the unicode code of the program in the memory.

Therefore, the python interpreter also has a interpreter whose default encoding can be sys. getdefaultencoding (). If the header information is not specified in the python file #-*-coding: UTF-8-*-, use the default

Note that this encoding is the code of the python interpreter software.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.