Python-character encoding chapter

Last Update:2017-07-23 Source: Internet

Author: User

Tags coding standards

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chapter Content

What is character encoding?
Python default encoding
Decode (decode) and encode (encoded)

Objective

For the character encoding problem, in the process of learning Python, many novices are crazy, I am one of them, so I came to godless this problem.

first, what is character encoding

First we need to know that all the data in the computer, whether it is text, pictures, video, or audio files, is essentially in binary storage (that is, a bunch of only 0, 1 numbers), and the computer only knows the numbers, it does not know that you are "A" it is "B". I need to say first, the computer's native language is English, so it just started with English. We also know that 1 bytes = 8 bit, which means that one byte equals 8 bits, and 8 bits can express how many different situations? Each person can be 0 or 1, then 8 is 2**8 (2 of 8), or 256 of cases. And then the computer was only used in the United States in the early days, they put numbers, letters (including the case), punctuation, space, the basic need to add up to only 127, so 256 completely enough for them to use. So they used a byte, in a variety of combinations to store the English language. In this way the computer recognizes the figure as the equivalent of recognizing the characters, and the computer supports the English language. Next, get to the point:

　　character encoding:(English: Character encoding) is also called a character set code, is to encode characters in the character set as an object in a specified set (for example: bit mode, natural number sequence, 8-bit group or electric pulse), so that the text stored in the computer and through the communication network transmission. The goal is to allow the computer to "recognize" characters.

　　ASCII: in the 60 's, the United States developed a set of character encodings that made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far. In fact, is the beginning of the thing, it should not be difficult to understand. The ASCII code uses a 7-bit 2-digit number to represent a character, and 7-bit 2-digit number can represent 2 characters 7 of a character, a total of 128 character.

With the use of computers, only the English characters can not meet the needs of each country, so countries have begun to the ASCII code behind the preemption ah, haha. And then there's all sorts of coding. Detailed coding development history is shown at the end of this section.

　　ANSI: A character code that, for the purpose of enabling a computer to support more languages, typically uses 1 bytes of the 0x00~0x7f range to represent 1 English characters, ASCII encoding (7 bits only). Out of this range is encoded using 0X80~0XFFFF, which is an extended ASCII encoding (with 8 bits). They should be consistent with ASCII within the ASCII range. Windows ANSI is actually Windows code pages, Simplified Chinese encoding gbk, which is actually a code page 936 of ANSI.

　　Note: The 128-bit to 255 character set corresponds to Latin. A single byte is full.

　　GB2312: from the above we know that the computer has been able to "know" English, but do not know Chinese ah, and a byte is used up, then how to do? So the Chinese people, re-write a table, the 8th corresponding to the Latin all to delete, Haha, and then set a character less than 127 meaning and the original same, but two more than 127 words connect prompt together when the expression of a Chinese character, so more than 6,000 Chinese characters abruptly to create out. The previous byte (high byte), from 0xa1 with high 0xf7, followed by a byte (low byte), from 0xa1 to 0xFE. GB2312 is actually a Chinese extension of ASCII. Of course, more than 6,000 Chinese characters are certainly unable to meet the Chinese demand, all follow-up and GB18030 and GBK, this on their own to understand it.

　　UNICODE: because each country has its own coding standards, and does not support each other, so this creates a lot of problems. So the international Organization for Standardization has developed a universal code--unicode. Unicode specifies that two bytes are used to represent one character, and a total of 65535 different characters can be combined, which is enough to solve the problem of support between our coding and coding.

　　UTF-8: Unicode is support UTF-8, why do you want to re-UTF-8? Because all Unicode characters are 2 bytes, so that the original 1 bytes can be done, it has to use 2 storage, this is not a waste of memory? And our program in English will be far more redundant Chinese, so bit to save memory, the UTF-8. UTF-8 rules, English only one byte, Chinese with 3 bytes.

Although the UTF-8 version has good international compatibility, Chinese needs to occupy 50% more database storage space than the GBK/BIG5 version, so it is not recommended for use by users with special requirements for international compatibility. To put it simply: for Chinese more sites, it is appropriate to use GBK encoding to save database space. For English more websites, it is suitable to use UTF-8 to save database space.

The above is the basic introduction, detailed development history--->> Character coding development history

Second, the Python default encoding

First, we have a basic understanding of the operating system's default encoding. The default encoding for CMD in the Windows Chinese environment is GBK , we can enter the chcp command (e.g.) in CMD and return the result: Active code page: 936 (936 for the GBK) The terminal default code for Linux is UTF-8 .

In Python, when the source code reads a syntax check, the strings in the source code are converted from the declared encoding to the Unicode type, and after the syntax check passes, the characters are swapped back to the original encoding.

In the Python 2.x environment

The python2.x default encoding is ASCII and can be detected in the following ways:

1 Import SYS 2 sys.getdefaultencoding ()

But you should not be naïve to think that you specify the encoding for UTF-8 will be able to output Chinese, in Windows CMD window character encoding is GBK, the output of the character set must be GBK, so you pass in UTF-8 Chinese natural GBK is incompatible, so still garbled.

Workaround:

1 #_*_ coding:utf-8 _*_2 3 #defines the contents of a variable as Chinese, the character set is UTF-84temp ="English"5 6 #decoding, you need to specify what the original code7Temp_unicode = Temp.decode ("Utf-8")8 9 #encoding, you need to specify what encoding to convert toTenTEMP_GBK = Temp_unicode.encode ("GBK") One  A #GBK encoding for output conversion - Print(TEMP_GBK)

1 #_*_ coding:utf-8 _*_2 3 #defines the contents of a variable as Chinese, the character set is UTF-84temp ="English"5 6 #decoding, you need to specify what the original code7Temp_unicode = Temp.decode ("Utf-8")8 9 #GBK encoding for output conversionTen Print(Temp_unicode) One #Windows Terminal requires Gbk,dos automatically converted to GBK A  -Note: Decode () and encode () will be sorted in the next section

In the Python 3.x environment

　　You can tell by the above method that the python3.x default encoding is UTF-8 . The Python 2.x to 3.x encoding problem has been greatly improved. The problem with coding in 2.x will definitely give you headaches, and 3.x is much easier. Although Unicode is introduced in both 2.x and 3.x, there are two types of strings in 2.x (Unicode and STR), the default is the STR type, which means that if you want to change the encoding of a string, you must first decode (decode) into Unicode, Then encode (encode) into the code you want to turn, and in 3.x the default is the Unicode type, example: string no longer distinguishes between "abc" and U "abc", the String "ABC" by default is Unicode.

A diagram of the stolen Alex teacher, only for Python 2.x:

Say it again. Unicode

Python is so popular that it has a great relationship with the introduction of Unicode because of its friendly support for the characters used in different countries and regions.

Now let's think about it, we've already said in the python2.x environment, the default is ASCII, then we write code when there is ASCII unsupported characters, what to do? Let's look at a diagram first:

(Image source: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html)

As we can see from this graph, all character processing that goes beyond the ASCII range, either before or after the output, is uniformly decoded (decode) into Unicode format in Python's execution memory. So Unicode ah, this universal code is not a joke, is completely a transit point, what code can be processed.

Note: In python2.x, you can see the "U ' abc" in front of the string of u, in fact, it is a Unicode type of string, but in 3 this method is discarded, because it is not necessary.

III, Decode () and encode ()

We should have known this thing from the two points above.

Decode (decode): Clears the original encoding format and decodes it into Unicode. Decode is to change the bytes type to str type. Therefore, there is no Decode method for the STR type.

Encode (encoding): Converts Unicode to another encoding. Encode is to change the str type to the bytes type. The bytes type also has no encode method.

These two methods often make mistakes, because it is unclear the compatibility between the various character encodings, and the use of objects. So we need to figure out the compatibility issues, such as: UTF-8 and GBK both support Chinese, but are incompatible, one is an extension set extracted from Unicode, and the other is a character set that is re-modified on an ASCII basis.

There are two examples of what is unclear:

1 #-*-coding:utf-8-*-2 #Specify encoding format3 4 ImportSYS5 #Import Module6 7 Print(Sys.getdefaultencoding ())8 #Print default encoding9 TenName ="Lyon" Onename_gb2312 = Name.decode ("Utf-8"). Encode ("gb2312") A #first decoded to Unicode (need to specify the original encoding format), and then encoded into gb2312 -  -GB2312_TO_GBK = Name_gb2312.decode ("GBK"). Encode ("GBK") the #first decode into Unicode, ibid., then encode into GBK -  - Print(name) - #Print name +  - Print(name_gb2312) + #print the name under GB2312 encoding A  at Print(GB2312_TO_GBK) - #print the name under GBK encoding

in Python 2.x

1 ImportSYS2 #Import Module3 4 Print(Sys.getdefaultencoding ())5 #Print default encoding6 7Name ="Lyon"8 #name_gb2312 = Name.decode ("Utf-8"). Encode ("gb2312") py29name_gb2312 = Name.encode ("gb2312")Ten  #py3 Default is Unicode, no more decode One  AGb2312_to_unicode = Name_gb2312.decode ("gb2312") - #decode into Unicode -  theGb2312_to_utf8 = Name_gb2312.decode ("gb2312"). Encode ("Utf-8") - #turn into Utf-8 -  - Print(name) + #Print name -  + Print(name_gb2312) A #print the name under GB2312 encoding at  - Print(Gb2312_to_unicode) - #print the name under Unicode encoding -  - Print(Gb2312_to_utf8) - #print the name under Utf-8 encoding

in Python 3.x

Finally finished this thing, hope to see friends can help point out the shortcomings and the wrong place. Thank!

Python-character encoding chapter

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More