Python character encoding

Source: Internet
Author: User

In python programming, strings can be expressed as "string" or "u" string ". Why is the use of the two expressions of strings not the use of the previous one?

Use the type () function to view the str objects and unicode objects. Are there any differences between these two objects? What are the commonly used encode () and decode? It is said that the python script uses two-byte encoding. What does this mean?

To answer the above questions, we must first clarify several concepts about encoding:
Character Set: Character set, which can be recognized by us. For example, ASCII specifies 127 character sets that can be expressed in one byte, including English letters, numbers, symbols, and some control characters. Of course, the ASCII character set is relatively small. The Character Set in python basically includes all the characters used in the world, such as Chinese, English, and Japanese characters. So basically all characters can be processed in Python. Code Point: A computer cannot directly recognize characters (because it can only directly identify binary code), so in order to allow the computer to process and store characters, you need to map the character into a value (because the value can be expressed in binary, and the computer can recognize it). This value is called the code point of the character. The character and code point are one-to-one ing, which is well defined by Unicode. Encode: Unicode specifies the Code Point for each character, but does not specify how the computer stores the Code Point. All have UTF-8, GBK, UTF-16 and Other encoding formats that dictate how computers store this Code Point, each of which is stored in a different way. For example, the "medium" Code Point is U + 2D2E (Unicode in the utable, 2D2E indicates the Code Point value ), the Code Point is encoded using four encoding protocols: GBK, Big5, UTF-8, and UTF-16. The actual binary representation is as follows:
       GBK      Big5        UTF-8     UTF-16 ~~\xD6\xD0  \xA4\xA4  \xE4\xB8\xAD  \x2D\x4E
Decode: Decodes the actual binary and obtains the Code Point of the character it represents. For example, if '\ xD6 \ xD0' uses GBK decoding, it will get 2D2E (the "in" Code Point), if it is decoded with a UTF-8, it will get an error, because it is not encoded with a UTF-8.

The preceding section briefly introduces the character and encoding and decoding concepts. For more information, see [character encoding and decoding]. Python uses two string expressions to distinguish the binary information of a character from the actual one. A unicode object is used to represent a character. It does not involve the underlying binary encoding information of a character. The str object is used to represent the binary information of characters. A unicode object can be encoded into multiple str objects using multiple encoding formats (such as UTF8 and GBK). Each str object represents a Binary Expression of the string. Multiple different str objects can be decoded into equal unicode objects (indicating that the strings are the same but the memory locations are different ). Unicode is also used to solve some problems caused by different encoding formats. We recommend that you use unicode objects to store strings in a uniform format. Unicode does not specify the specific binary information, but in order to store the Code Point value of each character, two bytes are required, so Python uses binary encoding (I don't know how to understand it ?). Because the str object does not specify its encoding format, you can only treat it as a byte string when processing it. When you print or decode the str object and do not know its encoding format, python can only operate on it in the default encoding format. If the encoding does not match, garbled characters or errors may occur. Str object:
It is called a string. It is a binary expression of a string encoded in a specific encoding format. It actually represents a byte string used to store binary information. Therefore, it is more appropriate to call it a byte string. For example
>>> Str = 'hello' # encode "hi" using the encoding format set by the system. You can run the locale command to check the encoding format. >>> Str '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd' # When locale is set to utf8, the Binary Expression After 'hao' encoding is a six-byte string

Unicode object:

It is used to express "characters", because the computer does not need to directly recognize characters, so Code Point is used to replace the characters. As follows:

>>> U "hello" U' \ u4f60 \ u597d'

Code Point 4F60 indicates "you", and 597d indicates "good ". It is only a ing between values and characters and is not used for specific encoding.
Decodes a str object to obtain the Code Point of the string, that is, the unicode object. Encode a unicode object to obtain its actual Binary Expression, that is, the str object. To convert a str object from one encoding format to another encoding format, you must first convert it to a unicode object, then convert the unicode object to another str object in the encoding format. The following code converts a str object from utf8 to gbk:
>>> Str = "hello" >>> str '\ xe4 \ xbd \ xa0 \ xe5 \ xa5 \ xbd' # uses the OS utf8 encoding format >>> unicode = str. decode ("utf8") # decodes data into a Code Point value >>> unicodeu '\ u4f60 \ u597d' >>> str_gbk = unicode. encode ("gbk") # encode the Code Point into the GBK format> str_gbk '\ xc4 \ xe3 \ xba \ xc3'> unicode. encode () # If the encoding format is not specified, it will be encoded in the system's default encoding format. The same applies to decode. Here, because ASCII cannot encode Chinese characters, an error occurs. Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii 'codec can't encode characters in position 0-1: ordinal not in range (128) >>> u = u "" # prefix u automatically converts the string from utf8 to unicode format u '\ u4f60 \ u597d'

Write File:
>>> file=open("test.txt", "a")>>> file.write(str)>>> file.write(str_gbk)>>> file.write(unicode)Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
From the perspective of bytes, The strobject in the utf8format and the strobject In the GBK format are successfully written into the test.txt file. However, an error occurs when writing the unicode object to the file. Why? The str object indicates the encoded binary information of the string, which can be directly written into the file. Then the unicode object represents the string's Code Point value, which is an abstract value used to represent characters and cannot be directly written into the file. Therefore, Python tries to use the default encoding format ASCII to encode unicode objects and save the results to files. But because ASCII cannot encode "hello", an error is returned.
Modify the default encoding format:
>>> Import sys >>> reload (sys) <module 'sys '(built-in) >>> sys. setdefaultencoding ("gbk") # [2]> str (unicode) # use the default GBK to encode unicode ("hello, convert to str object '\ xc4 \ xe3 \ xba \ xc3'> unicode. encode () # use the default GBK to encode unicode ("hello") and convert it to the str object '\ xc4 \ xe3 \ xba \ xc3'> "hello ". decode () # Because UTF8 is used by the system, "hello" is a UTF-8 byte string. The default GBK is used to decode the byte string, however, the result is incorrect. Therefore, we need to ensure that the format of the byte string is decoded. U' \ u6d63 \ u72b2 \ u30bd'
 




References: [1] http://www.newsmth.net/bbscon.php? Bid = 284 & id = 84741 [2] Using http://www.joelonsoftware.com/articles/unicode.html#4] http://www.unicode.org/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.