About the coding problem under Python?

Last Update:2016-06-06 Source: Internet

Author: User

Tags locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

May I ask which Daniel can explain in detail and in a popular
Python2 the relationship between Unicode, Utf-8, decode, and encode.

I feel my understanding in this area is not clear enough, I hope that Daniel can help, thank you!!

Reply content:

Py2 coding is actually the most close to the actual coding form. It is py3, if you encounter a coding error and other problems, directly let you kill yourself ...

Let's say what the code is: we know that any data stored in a computer is stored in binary, but a string of text, if stored too much space for a picture, is difficult to parse, so the ASCII standard code uses 7-bit binary notation with 128 characters and a control symbol. Of course 7 bit is not conducive to data alignment, so simply to 8 bits of storage, the highest bit 0 is good, just a byte, this is the base ASCII encoding.

However, these 128 characters character, although the common English symbols and necessary control symbols (such as newline, carriage return, Eoln, EOF), but not for users of other languages can not be used, after all, the characters are different wow ...

First of all, the European Latin family pointed out that since a byte a character, only used to 7 bits, then there are 128 numbers can be used, so the corresponding Latin major symbols, the same single-byte representation, so that the use of a more than one, this set of codes called Latin-1

In the future, most other countries of the phonetic alphabet say that we do not use the Latin symbol, then change the 128 extra characters to other symbols, mapping their own text is no problem. So there is a multi-coded page, the original codepage.

But China and Japan and South Korea led by the font language Department of the country can not ah, you ya of dozens of symbols, Chinese and other light characters commonly used is good thousands of ah ... Thus there is a codepage936/gb2312 for Chinese, with two bytes representing a Chinese character, which contains thousands of characters commonly used, and the highest bit of 0 is fully compatible with ASCII, but if the highest bit is 1, it must be two bytes in succession to represent a Chinese character- Then there is the GBK, which specifies more characters, is compatible with gb2312, and is also a double-byte record.

However, two things formed a hindrance: one is the profound Chinese, Chinese characters are too many, counted on the rare word, two bytes is not enough to use; On the other hand, in the GB code, all double-byte characters will be interpreted as Chinese characters, so up to English and Chinese mixed, multi-lingual, but also affect such as network transmission and other scenarios, Because of the same double-byte binary data, the corresponding GBK Chinese is obviously different from the corresponding Japanese hangul, which must be run with the encoding type, and a little less careful not to know what the language is.

Therefore, Unicode is present, and is a multi-language text encoding under the ANSI standard. Unicode uses a 32-bit binary representation of each character, and any symbol in any language has its own encoding, so that a set of encodings can be used to process many different languages at the same time.

Unicode is a coding method that involves numbering only, regardless of transmission and storage. In response to demand, Unicode has produced several transmission codes, of which the utf32,utf16 and UTF8 are more prevalent. UTF32 is a 32-bit fixed code per character, full mapping Unicode original encoding without change (of course, the problem of the end-order of the transmission); Utf16 is a minimum of 16 bits up to 32 bits, which is a variable-length Unicode transmission scheme to achieve compatibility with some codepage While the utf-8 is the smallest 8-bit maximum 32-bit encoding, it is longer and the English part is fully ASCII-compatible. Because of the space-saving and ASCII compatibility of these two points, the use of UTF8 to minimize the cost of becoming the mainstream.

In Python2, there are three parts related to coding:

The first is the source code identification problem. The original Python interpreter simply uses ASCII encoding to parse the source code to generate a syntax tree. Given the number of strings in other languages that may exist in the source code, the Setdefaultencode interface is provided, but it is very easy to cause various problems. PEP263 indicates that in the first line of the file or the second line (in the case of the first behavioral Unix script callout only) write a special-format comment # coding:xxx You can specify the character encoding that the interpreter uses when interpreting the source code.

The second part is built-in type conversion: The Str class in Python2 is actually a type that does not store encoded information. That is, it handles and pairs the content binary in bytes. A "string" of type str, if iterated, is directly split into bytes to be processed. However, once we need to process a single word that is not single-byte encoded, Python only provides a type to solve the problem, that is, the Unicode class (note that the class in the real py is UTF8 for memory storage, not utf32/unicode the original code), So it is often necessary to convert from one to the other, using Encode/decode two methods. In principle, the Decode method is to convert a str into a Unicode,encode method by parsing the specified encoding and storing a Unicode object in a str object with the specified encoding.

The 3rd is the input and output. The essence of Python2 print is to output the contents of STR to the pipe, and if you print a Unicode object, it will automatically be encode after the locale environment variable and then output to Str. However, the locale environment variable is not usually set on Windows, Py2 is processed by default ASCII encoding, so it is wrong to encode Chinese naturally. The workaround is to manually encode the output as an acceptable encoded output. Win under General are gbk,linux under General are UTF8

Py3 Str in the unicode,bytes is similar to the original STR, the default code parsing with UTF8, the default output encoding is UTF8.

ASCII, Unicode is a character set, and Utf-8 is the encoding of the character set.

Utf-8 is an encoding of the Unicode character set.

If you do not specify how the py file is encoded, the program is decoded by default in the ASCII character set. So you need to declare how the file is encoded.

Decode and encode

inch [1]: a=' Hello 'inch [2]: a out[2]: '\XE4\XBD\XA0\XE5\XA5\XBD'inch [3]: b=a.Decode(' Utf-8 ')inch [4]: b out[4]: u '\u4f60\u597d'inch [5]: type(b) out[5]: Unicodeinch [6]: type(a) out[6]: Strinch [7]: C=b.encode(' Utf-8 ')inch [8]: C out[8]: '\XE4\XBD\XA0\XE5\XA5\XBD'inch [9]: C==a out[9]: True

Good at searching, refer to Liaoche's blog: Strings and encodings This you search online a lot of AH. Unicode is codepoint is an abstract \uxxx that represents a character. The Utf-8 is a Unicode, with X bytes representing an abstract codepoint \uxxx. So utf-8 is the actual byte string, and Unicode is abstract. You can put the abstract Unicode encode (encoded) into Utf-8. You can also decode the actual utf-8 back to Unicode. Said so much, and then Ruan ... Please search "will python2 Chinese characters will appear garbled things once said clearly"
Look! With the Python3 bar, recently with Python, has been 2.x coding problems to suspect life, really pit dead. / http nedbatchelder.com/text/ unipain.html
This article is good ~



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More