Python2 encoding summary, python2 Encoding

Last Update:2016-01-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following are several problems that python2 often encounters and their explanations.

#-*-Coding: UTF-8 -*-

Python2 uses ASCII encoding by default, but many Chinese characters are used in the actual encoding process. In order not to report errors in programs that contain Chinese characters, it is also in line with international conventions, generally, we set the file encoding to UTF-8.

There are many encoding formats, as long as the declaration of the first or second line complies with the regular expression "coding [: =] \ s * ([-\ w.] +) ". The general declaration method is #-*-coding: UTF-8 -*-.

Str = "hello" print str

Run the preceding code and the program reports the following error: SyntaxError: Non-ASCII character '\ xe4' in file D:/TestPython/test/111.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details. This is a prompt that the program contains non-ASCII characters. If UTF-8 declaration is added, the program will not report an error.

#-*-Coding: UTF-8-*-str = "hello" print str

Although no error is reported in the preceding statement, the output is garbled. Why? This is what we will talk about below.

Encode and decode

Before explaining encoding and decoding, let's talk about the relationship between Unicode and UTF-8. We recommend this blog to you.

It can be understood that a string is composed of characters, which are stored in binary format in computer hardware. This binary form is encoding. If you directly use the "String↔Delimiter↔Encoding binary representation (encoding) "increases the complexity of conversion between different types of encodings. Therefore, an abstract layer, "String↔Delimiter↔Representation irrelevant to storage↔Bytes binary representation (encoding) ". In this way, you can use a storage-independent form to represent characters. during conversion between different encodings, You can first convert them to this abstraction layer, then convert it to other encoding formats. Here, unicode is a storage-independent representation, and UTF-8 is a binary representation ". The string in python2 has two forms: str and unicode. Str can be understood as the binary encoding format in the preceding section, and unicode can be understood as the abstraction layer. Encode is encoding, that is, from unicode to binary encoding formats such as UTF-8 and gb2312. Decode is decoding, that is, from the binary encoding format to the unicode encoding format. See the code below:

#-*-Coding: UTF-8 -*-

Str1 = "hello"
Print type (str1)
Str2 = str1.decode ("UTF-8 ")
Print type (str2)

Str1 is 'str' type and is converted to 'unicode 'type through decode.

See the encode below:

#-*-Coding: UTF-8-*-str1 = u "" print type (str1) str2 = str1.encode ("UTF-8") print type (str2)

Str1 is of the unicode type and is converted to the str type through encode.

Let's look back at the first question. Why does the Code output garbled characters. Because the file encoding format is UTF-8, but print is printed to the console, the console cannot display UTF-8 encoding characters. So we need to convert the format.

#-*-Coding: UTF-8-*-str = "hello" str = str. decode ("UTF-8") print str

In many cases, the ignore parameter must be added for encoding and decoding to convert the data correctly. For example. encode ('utf-8', 'ignore') or. decode ('utf-8', 'ignore.

Chardet obtains the encoding format

Sometimes we cannot know what encoding a string is. For example, when a webpage is captured, some are UTF-8 encoded, and some are gb2312 encoded, how can we get the encoding format and convert it to unicode. Here we will introduce chardet, a third-party library. The usage is as follows:

# -*- coding: utf-8 -*-import chardetstr = "xxxxx"str_type = chardet.detect(str)code = str_type['encoding']

Code is the str encoding format. However, some people reflect that the encoding format obtained by this method is not accurate and the speed is slow. I personally tested it at a moderate speed, but I haven't encountered any inaccuracy yet. You can use it as needed. I am here only to provide an idea. If anyone has a better way, I can tell my younger brother.

Import sys

Reload (sys)

Sys. setdefaultencoding ('utf8 ')

I have encountered some inexplicable encoding errors before. I am confused when I find this solution on the internet, and I don't know how it works. Today I saw a good blog and recommended it to you at http://blog.csdn.net/crazyhacking/article/details/39375535. The following content is referenced in this article:

The encoding and decoding methods in Python are unicode and str. The encoding is unicode-> str. On the contrary, the decoding is str-> unicode. The remaining problem is to determine when encoding or decoding is required. For the "encoding indication" at the beginning of the file, that is, #-*-coding:-*-this statement. Python default script files are all UTF-8-encoded and use the "encoding indication" to correct when there are non-UTF-8-encoded characters in the file. about sys. defaultencoding, which is used when the decoding method is not explicitly specified. For example, I have the following code :#! /Usr/bin/env python #-*-coding: UTF-8-*-s = 'China' # note that str is of the str type, rather than unicode s. encode ('gb18030') re-encodes s into the gb18030 format, that is, unicode-> str conversion. Because s is of the str type, Python automatically decodes s to unicode and then encodes it into gb18030. Because the decoding is automatically performed by python, we do not specify the decoding method, python will use the method specified by sys. defaultencoding to decode. In many cases, sys. defaultencoding is ANSCII. If s is not of this type, an error occurs. Taking the above information, my sys. defaultencoding is anscii, and the s encoding method is the same as the file encoding method, which is utf8, so the error is: UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 in position 0: ordinal not in range (128) in this case, we have two ways to correct the error: one is to explicitly indicate the encoding method of s #! /Usr/bin/env python #-*-coding: UTF-8-*-s = 'Chinese' s. decode ('utf-8 '). encode ('gb18030') 2. Change sys. defaultencoding is the file encoding method #! /Usr/bin/env python #-*-coding: UTF-8-*-import sys reload (sys) # After Python2.5 is initialized, sys will be deleted. setdefaultencoding: We need to reload sys. setdefaultencoding ('utf-8') str = 'Chinese' str. after checking the encode ('gb18030'), change it to print "<p> addr:", form ["addr"]. value. decode ('gb2312 '). encode ('utf-8') successfully passed.

However, this method is awkward to use. We should try our best to control the encoding, clarify the encoding format, and write it on our own.

Personal Summary

In the actual programming process, it is best to unify the encoding format in the Code, such as unicode, because the encoding problem does not need to be considered. The storage type (UTF-8 and GBK) is converted to the display or output type ).

The above are some of the problems encountered during the recent compilation of python code and their summary. If there is anything wrong, please reply to them in a timely manner. Thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python2 encoding summary, python2 Encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python2 encoding summary, python2 Encoding

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support