Python character encoding (iii)

Last Update:2017-09-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the character encoding classification:

The computer was invented by the Americans, the earliest characters encoded as ASCII, only the English alphanumeric and some special characters and the corresponding relationship between the numbers. Can be represented at most 8 bits (one byte), that is: 2**8 = 256, so the ASCII code can only represent a maximum of 256 symbols

Of course, our programming language is not a problem in English, ASCII enough, but in the processing of data, different countries have different languages, the Japanese will be in their own programs to add Japanese, Chinese will join Chinese.

And to express the Chinese, take a single byte table to represent a man, is impossible to express (even elementary school students know more than 2000 Chinese characters), the solution is only one, is a byte with >8 bit 2 in the representation, the more the number of changes on behalf of more, so that can be as many as possible to express the Chinese characters

So the Chinese have set their own standard gb2312 code, which specifies the correspondence between characters, including Chinese.

The Japanese have set their own shift_jis codes.

Koreans set their own EUC-KR codes (in addition, South Koreans say that computers were invented by them, requiring the world to be harmonized with Korean code)

At this time, the problem arises, proficient in 18 languages of the small week classmate modest in 8 languages to write a document, then this document, according to which country standards, will appear garbled (because the various standards at the moment are only the text of their own country, including the character and the corresponding relationship between the numbers, if the use of a national encoding format, Then the language of the remaining languages will be garbled when parsing the text.

So there is an urgent need for a world standard (which can contain all the languages of the world) so the Unicode came into being (Koreans say no, then no eggs)

ASCII uses 1 bytes (8-bit binary) to represent one character

Unicode commonly used 2 bytes (16-bit binary) represents a character, the uncommon Word needs 4 bytes

Cases:

The letter x, denoted by ASCII is decimal 120, binary 0111 1000

Chinese characters are 中 beyond the ASCII encoding range, Unicode encoding is decimal 20013 , binary 01001110 00101101 .

The letter x, which uses Unicode to represent the binary 0000 0000 0111 1000, so Unicode compatible ASCII, also compatible with all nations, is the world's standard

This time the garbled problem disappears, all the documents we use but the new problem arises, if all our documents are English, you can use Unicode more space than ASCII, the storage and transmission is very inefficient

In the spirit of saving, there has been the conversion of Unicode encoding to "Variable length encoding" UTF-8 encoding. The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding:

character	ASCII	Unicode	UTF-8
A	01000001	00000000 01000001	01000001
In	X	01001110 00101101	11100100 10111000 10101101

It can also be found from the table above that the UTF-8 encoding has an added benefit that ASCII encoding can actually be seen as part of the UTF-8 encoding, so a large number of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Python character encoding (iii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python character encoding (iii)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support