Python Basic 3 character encoding

Last Update:2017-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The main contents of this section:

Basic concepts
ASCII and its extensions
Chinese character encoding
Unicode
Comments
Reference pages

Basic concepts

The information stored in the computer is a binary 0/1 string, when we want to store in the computer such as English, Chinese, punctuation characters, we need to convert the characters into binary 0/1 strings before saving to the computer, and when we want to read the information stored in the computer, It is necessary to convert the binary 0/1 strings into the original characters and then display them through the display channels.

Here are two basic concepts of character encoding:

Character Set (Character set): Refers to a collection of all the characters supported by the system. Characters include all the text and symbols involved, such as language, punctuation, and graphic symbols in each country. Character encoding (Character Encoding): A rule that maps characters in a character set to binary 0/1 strings, such as how many bytes to store, what information is stored in each byte, and so on. For example, the character a can be mapped to binary 01000001 (that is, decimal 65, Hex 0x41). The characters after encoding are suitable for hard disk storage, network transmission, etc., but to display it again, you need to map 0/1 strings back to the original character, called the decoding process, with the opposite rule. In addition, the relevant organizations in the development of coding standards, "set of characters" and "encoding method" is sometimes developed, such as what we commonly say "character sets" such as ascii,gbk,gb2312, in addition to the "character set" this layer of meaning, but also contains the meaning of "character encoding."

ASCII and its extensions

Within the binary 0/1 string of stored information inside the computer, each position (bit) has only 0 and 12 states, and a byte (byte, containing 8 bits) can be combined to form 2^8=256 states, from 0000000 to 11111111. If each state corresponds to one character, then one byte can represent 256 characters. The ASCII code is the encoding standard that specifies a subset of the 256 characters. ASCII (American Standard Code for information Interchange, US Information Interchange standards Codes) is a set of character encodings developed by the United States in 1967, which specifies the mapping between English characters and other common characters and bits. is a far-reaching coding system that continues to date. ASCII is mainly used to display modern English, 128 symbols are sufficient for English, but for other languages, because of their large number of characters, 128 symbols appear to be far from enough. Some countries can barely display the language of some western European countries by using the highest bits of the bytes that are idle in new special symbols, foreign letters, and graphic symbols, called extended ASCII codes. In the extended ASCII code, 0-127 of the first 128 codes represent the same symbols, but the last 128 128–255 are different from one country to another.

Chinese character encoding

ASCII encoding can be very good to encode English characters, but to a large number of Chinese characters are powerless, in order to display Chinese, the relevant organizations designed the corresponding Chinese coding standards. To deal with Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. Now the PC platform must support GB18030, the embedded products are not required. So mobile phones, MP3 generally only support GB2312.

From ASCII, GB2312, GBK to GB18030, these coding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters. In these codes, English and Chinese can be handled in a unified manner. The method of distinguishing Chinese encoding is that the highest bit of high byte is not 0. According to the programmer, GB2312, GBK, and GB18030 belong to the double-byte character set (DBCS).

Some Chinese Windows default internal code or GBK, you can upgrade to GB18030 through the GB18030 upgrade package. But GB18030 relative GBK increases the character, the ordinary person is difficult to use, usually we still use the GBK to refer to the Chinese Windows inside code.

Unicode

The ASCII code cannot represent all the words and symbols in the world, so a new encoding that can represent all the characters and symbols is needed, i.e. Unicode Unicode (Uniform Code, universal Code, single code) is a character encoding used on the computer. Unicode is created to address the limitations of the traditional character encoding scheme, which sets a uniform and unique binary encoding for each character in each language, which specifies that characters and symbols are represented by at least 16 bits (2 bytes), that is: 2 **16 = 65536, Note: Here is a minimum of 2 bytes, possibly more

UTF-8 is a specific implementation of Unicode, is a variable-length Unicode, and compatible with ASCII, he no longer uses a minimum of 2 bytes, but instead all the characters and symbols to classify: ASCII code in the content of 1 bytes, the European characters are stored in 2 bytes, East Asian characters are saved with 3 bytes ...

Python3 default is utf-8 python2 default is ASCII kanji in Python2 will error, need #--Coding:utf-8--Add this paragraph. No need in Python3.

Comments

Other: When line gaze: # commented content Multiline Comment: "" "" "

Reference pages

Http://noalgo.info/571.html?utm_source=tuicool&utm_medium=referral

http://python.jobbole.com/82107/

Python Basic 3 character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Basic 3 character encoding

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support