Huang Cong: To solve python Chinese garbled characters, first understand the difference between "character" and "Byte"

Source: Internet
Author: User

Reprinted from: http://hcsem.com/2095/

Let me take a look at the character issue. Although I am not quite clear about the details of the Python encoding process, I have taken a look at it temporarily, which is similar to the principle of Perl.

The most important thing is to distinguish between "character" and "Byte". "character" is abstract, while "Byte" is specific.

For example, a "medium" character is represented in the following bytes in different encodings:

GBK Big5 UTF-8 UTF-16LE
\ XD6 \ xD0 \ xA4 \ xA4 \ xE4 \ xB8 \ xAD \ x2D \ x4E

The "medium" of the "abstract" character does not refer to "\ xD6 \ xD0", "\ xA4 \ xA4", or any byte. It should be understood: \ xD6 \ xD0 in GBK Encoding
The character referred to in the section (which can be referred to in linguistics), or the character referred to in the UTF-8 code "\ xE4 \ xB8 \ xAD", but not the specific bytes themselves

The problem is that abstract characters must be stored and transmitted as data. In other words, you must store the character "medium" in the internal implementation of the program, you must use certain bytes. You can
"\ XD6 \ xD0" or "\ xE4 \ xB8 \ xAD" or "\ x2D \ x4E". In Windows, Python uses UTF-
16LE (?), It means that its "character" carrier encoding is UTF-16LE

Sys. setdefaultencoding (name)
Set the current default string encoding used by the Unicode implementation.

This is written in this document. If I understand it correctly, the function is used to change the carrier code of the "character", sys. after setdefaultencoding ('gbk'), the character "medium" is not carried by "\ x2D \ x4E" in the program, but by "\ xD6 \ xD0 ".

What is the difference between str and unicode in Python2.x? Literally, it is easy to confuse. In fact, you can think of str as a "Byte string" and unicode as a "string" (string is always translated into a "string ", it is easy to confuse people here). Let's look at the following example:

#-*-Coding: gb2312 -*-

S = "Zhang sanli 4"
Print len (s) # => 8
U = s. decode ('gbk ')
Print len (u) # => 4

My script code uses GBK instead of UTF-8, and you will see len (s) is 8, which is the actual 8 "bytes" used by these four Chinese characters, while len (u) it is 4, which indicates that there are 4 "characters"

What does encode and decode mean? The so-called encoding means to convert meaning into symbols, while decoding means to restore the symbols into meaning. Here, encode should be understood as converting abstract characters into specific bytes, while decode restores specific bytes into abstract characters.

The problem is that both the str class and the unicode class have both the encode and decode methods. This is a setting that makes me disagree. If the partition is based on Byte and character
The encode method is only owned by the unicode class, And the decode method is only owned by the str class, because "meaning" can only be converted to "symbol ", meaning is restored to meaning
This is meaningless.

Suppose we are like this:

#-*-Coding: gb2312 -*-

S = "Zhang sanli 4"
U = s. decode ('gbk') # No problem. The bytes are decoded as characters and the symbols are restored as meanings.
S2 = s. encode ('gbk ')
# Error! Bytes can no longer be encoded into bytes, unless s is all ASCII characters, but so s2 and s are exactly the same, what is the significance of this operation?
U2 = u. decode ('gbk ')
# Another error occurred! Only u can contain ASCII characters. u2 and u are completely the same, and this operation is meaningless.

Here I will mention the processing method of Perl. I don't know whether the principle of processing the encoding in Python is directly from Python, this is a common practice in different languages (but Ruby does not). In short, Python2.x is flawed.

Perl has only one string, which actually distinguishes between strings and byte strings (with UTF-8 as the underlying bearer encoding), but unlike Python2.x for str and unicode,
The string contains a UTF-8 flag. When the flag is on, the string is a "character" string. When the flag is off, It is
"Byte" string. Its encoding and decoding functions are as follows:

$ Octets = encode (ENCODING, $ string [, CHECK])

$ String = decode (ENCODING, $ octets [, CHECK])

$ Octets is a byte string, $ string is a string, that is, encode only works for $ string, and decode only works for $ octets, unlike
Python has both str and unicode methods, but one of them is useless. Larry
Wall is a linguistics. He designed this set of characters and byte relationships to fully comply with the linguistic theory of "finger-pointing", and GvR may not be able to do anything about linguistics, python cannot be processed.
How subtle.

Let's talk about the encoding problem of file. write:

#-*-Coding: gb2312 -*-


S = "Zhang sanli 4"
U = s. decode ('gbk ')

F = open('text.txt ', 'w ')
F. write (u) # error!
F. write (u. encode ('gbk ') #

The cause of the error is simple. You want to output "characters" instead of "bytes ". As mentioned above, "characters" are abstract. You cannot write an abstract object to a file. Although the following abstract characters
But Python does not seem willing to mix the bytes at the bottom of unicode with IO, which leads to f. write (a_unicode)
Failed. Of course, if a_unicode only contains ASCII characters, this can be successful. However, this is a shortcut and a confusing shortcut.

Then what is the meaning of the u mark? It is very simple, that is, Automatic completion of byte → character conversion

#-*-Coding: gb2312 -*-

S_or_u1 = "Zhang sanli 4"
Print type (s_or_u1) #=> <type 'str'>

S_or_u2 = u "Michael Jacob"
Print type (s_or_u2) # => <type 'unicode '>

U "Zhang sanli 4" is equivalent to "Zhang sanli 4". decode (a_enc). The a_enc here is # gb2312 set by the coding line.

I have to say that (whether it comes from Perl or not) This character processing method is obscure. The concept of character and byte differentiation is not easy to understand, and the details of Python itself are not processed.
Well, Perl is very clean and not easy to understand. It is even worse if Python is not clean. In addition, a brief introduction to Ruby's character processing method is provided, which is totally different from Perl:

Ruby does not distinguish between characters and bytes. All strings are "Byte strings with an encoding attribute ". Because there are no abstract characters, there is no byte → character conversion, and there is no or need
To decode the method, the String class of Ruby only has the encode method. Because there is no abstract "character" concept, Ruby encoding should be easier than Perl and Python.
Understanding. Another benefit without the "character" is that multi-byte Text Processing does not require intermediate conversion. You need to process Chinese Characters in Perl. The source file is GBK encoded and must be converted
UTF-8, Perl for processing: Python needs to be converted to a UTF-16 before processing. For massive texts, this conversion process must consume certain resources. Ruby does not need
To perform this conversion, you can directly process GBK or other encodings. This may also take into account the actual Japanese, Japanese shift-jis (?) It is a local code and is not compatible with ASCII,
Unlike GBK, which is compatible with ASCII, you do not need to convert the document to process native encoding. If Perl's character-byte differentiation is the academic practice of linguistics, Ruby is
It meets the practical practice of Multi-Byte Character Processing.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.