What about encoding in Python?

Source: Internet
Author: User
Which of the following can explain in detail the relationship between unicode, UTF-8, decode, and encode in Python2. I feel that my understanding in this area is not clear enough. I hope you can help me. Thank you !! Which of the following is a detailed and popular explanation,
The relationship between unicode, UTF-8, decode, and encode in Python2.

I feel that my understanding in this area is not clear enough. I hope you can help me. Thank you !! Reply: The encoding of py2 is actually the most practical. On the other hand, it is py3. If you encounter a problem such as an encoding mark error, let you stop yourself ......

First, let's talk about encoding: we know that all the data stored in the computer is binary, but if a string of text is stored as a waste of space, it will be difficult to parse, therefore, the ascii Standard Code uses a 7-bit binary code to mark 128 characters and control symbols. Of course, 7 bits are not conducive to Data Alignment, so simply store the data with 8 bits, and add 0 to the maximum bits. The correct byte is the basic ascii code.

However, although these 128 characters contain common English symbols and necessary control symbols (such as line breaks, carriage returns, EOLN, and EOF), they cannot be used by users in other languages, after all, the characters are different ......

First of all, the European-based Latin series pointed out that since a byte contains only seven characters, there are still 128 numbers available, so the corresponding major symbols of the Latin series are specified, the same single-byte representation makes it possible to use an extra bit. This encoding is called latin-1.

In the future, most countries in other pinyin languages do not need to use the Latin symbol, so it is okay to change the 128 extra characters into other symbols and map your own text. So there is a multi-encoding page, that is, the initial codepage.

However, China, Japan, and South Korea do not support the Chinese character family. You have dozens of symbols, but thousands of Common Chinese characters ...... Therefore, codepage936/gb2312 is used for Chinese characters. Two bytes are used to represent a Chinese character, which contains thousands of common characters. The part whose maximum bit is 0 is fully compatible with ascii characters. However, if the maximum bit is 1, it must appear in two bytes consecutively to indicate a Chinese character. Later, GBK appears. The specified number of characters is more, and it is compatible with gb2312. It is also a dubyte record.

However, there are two obstacles: one is that there are too many Chinese characters and the two bytes are not enough when it comes to uncommon words. On the other hand, in the case of GB encoding, all double-byte characters are interpreted as Chinese characters, so they can be mixed in English and Chinese at most. The multi-language is boring, and also affects scenarios such as network transmission. Because of the same double-byte binary data, the corresponding GBK Chinese is obviously different from the corresponding Japanese and Korean. This requires running with the encoding type. If you don't pay attention to it, you don't know what the language is.

As a result, unicode occurs, which is a multi-language text encoding under the ANSI standard. Unicode uses 32-bit binary to represent each character, and any symbols in any language are independently encoded, so that you can use a set of encoding to process multiple languages at the same time.

Unicode is an encoding method that only involves numbers and does not care about transmission and storage. In response to requirements, unicode produces several transcoding codes, including utf32, utf16, and utf8. Utf32 is a 32-Bit fixed encoding for each character. It fully maps to the original unicode encoding without changing it (of course, it specifies the end sequence problem during transmission ); utf16 is at least 16-bit and up to 32-bit. It is a variable-length unicode Transfer Scheme to realize compatibility with some codepages, while UTF-8 is a minimum 8-bit and up to 32-bit encoding, the English part is fully compatible with ascii. Because it saves space and ascii compatibility, utf8 is the least expensive and becomes the mainstream.

In python2, there are three encoding-related parts:

First, source code recognition problems. Originally, the python interpreter used ascii encoding to parse the source code to generate a syntax tree. Considering that the source code may contain strings in other languages, the setdefaultencode interface is provided, but it is very easy to cause various problems. PEP263 indicates that a comment in special format is written to the first or second line of the file (only when the first line of Unix script annotation is used) # coding: xxx can specify the character encoding used by the interpreter to interpret the source code.

The second part is built-in type conversion: The str class in python2, which is actually a type that does not store the encoding information. That is to say, it processes, compares, and computes the binary content in bytes one by one. If 'str' is iterated, it is split into bytes for processing. However, once we need to process a single word that is not encoded in a single byte, python only provides one type to solve the problem, that is, unicode class (note, in essence, this class in py is utf8 for memory storage, rather than utf32/unicode original encoding). Therefore, mutual conversion is often required and the encode/decode methods are used. In principle, the decode method parses a str according to the specified encoding and converts it to unicode. The encode method uses the specified encoding to represent a unicode object and store it in a str object.

The third point is input and output. The essence of print in Python2 is to output the items in str to PIPE. If you print a unicode object, it will automatically encode according to the LOCALE environment variable and then convert it to str for output. However, in Windows, the locale environment variable is not set, and py2 is processed according to the default ascii encoding. Therefore, the encoding of Chinese characters is incorrect. The solution is to manually encode the corresponding output end and output the code that is acceptable. Generally, Windows use gbk and linux use utf8.

Str in py3 is unicode, and bytes is similar to the original str. The default code is utf8 for parsing, and the default output encoding is utf8.

ASCII and unicode are character sets. UTF-8 is the encoding method of character sets.

UTF-8 is a unicode Character Set encoding method.



If you do not specify the encoding method of The py file, the program uses the ASCII character set for decoding by default. Therefore, the file encoding method must be declared.

Decode and encode

In [1]: a = 'hello' In [2]: aOut [2]: '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd' In [3]: B =. decode ('utf-8') In [4]: bOut [4]: U' \ u4f60 \ u597d 'In [5]: type (B) Out [5]: unicodeIn [6]: type (a) Out [6]: strIn [7]: c = B. encode ('utf-8') In [8]: cOut [8]: '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd' In [9]: c = aOut [9]: True
Good at searching. Refer to Liao Xuefeng's blog: string and encoding. You have a lot of questions on the Internet. Unicode is a codepoint that represents an abstract \ uxxx character. UTF-8 is a type of unicode, which uses x bytes to represent an abstract codepoint \ uxxx. therefore, UTF-8 is the actual byte string, while unicode is abstract. you can encode the abstract unicode encode into UTF-8. you can also return the actual UTF-8 to unicode. and ruan... please search for "Clear all Chinese Characters in python2 with garbled characters"
Look! Use python3. Recently, python has been used. The problem of 2.x encoding has become a serious issue .. Http://nedbatchelder.com/text/unipain.html
This article is good ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.