A detailed analysis of the difference and usage between STR and Unicode in Python code processing

Last Update:2017-03-16 Source: Internet

Author: User

Tags file handling

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Python processing Chinese, reading files or messages, if found garbled (string processing, read and write files, print), most people's practice is to call Encode/decode for debugging, and did not explicitly consider why garbled, today we discuss how to deal with the coding problem.

Note: The following discussion is python2.x version, not tested under PY3K

Errors that occur most frequently during debugging

Error 1

Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe6 in position 0:ordinal not in range (128)

Error 2

Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode     return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

First of all

Must have a general concept to understand the next character set, character encoding

ASCII | Unicode | UTF-8 | Wait a minute

Character-coded notes: Ascii,unicode and UTF-8

STR and Unicode

Both STR and Unicode are subclasses of the basestring

So there's a way to judge if it's a string.

def is_str (s): Return Isinstance (S, basestring)

STR and Unicode conversions

STR---decode (' the_coding_of_str ')--Unicode Unicode---encode (' the_coding_you_want ')

Difference

STR is a string of bytes, consisting of Unicode encoded (encode) bytes.

How to declare

>>> s = ' Chinese ' = U ' Chinese '. Encode (' Utf-8 ')  >>> type (' Chinese ') <type ' str ' >

Length (number of bytes returned)

>>> u ' Chinese '. Encode (' utf-8 ') ' \xe4\xb8\xad\xe6\x96\x87 ' >>> len (U ' Chinese ' encode (' Utf-8 ')) 6

Unicode is the real string, made up of characters

How to declare

>>> s = U ' chinese ' >>> s = ' Chinese '. Decode (' Utf-8 ') >>> s = Unicode (' Chinese ', ' utf-8 ')  >>> type (U ' Chinese ') <type ' Unicode ' >

The length (the number of characters returned), which is really desired in logic

>>> u ' Chinese ' u ' \u4e2d\u6587 ' >>> len (U ' Chinese ') 2

Conclusion

Figure out whether to deal with STR or Unicode, using the right processing method (Str.decode/unicode.encode)

Here's how to judge if it's unicode/str

>>> isinstance (U ' Chinese ', Unicode) True >>> isinstance (' Chinese ', Unicode) False  >>> isinstance ( ' Chinese ', str) True >>> isinstance (U ' Chinese ', str) False

Simple principle: Do not use encode to STR, do not use decode for Unicode (in fact, str can be encode, see Finally, in order to ensure simple, not recommended)

>>> ' Chinese ' encode (' Utf-8 ') Traceback (most recent call last): File "<stdin>", line 1, in <module> Unico Dedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range  >>> u ' Chinese '. Decode (' Utf-8 ') Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode     return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

Different encoding conversions, using Unicode as the intermediate encoding

#s是code_A的str s.decode (' code_a '). Encode (' Code_b ')

File handling, IDE and console

Process, which can be used as a sink, an entrance, an exit

Entrance, all into Unicode, the pool all using Unicode processing, export, and then into the target code (of course, there are exceptions, the processing logic to use the specific encoding case)

Read the file external input encoding, decode to Unicode processing (internal encoding, unified Unicode) encode to the desired target encoding to write to the target output (file or console)

The IDE and console error occurred because the encoding and IDE encoding inconsistencies caused

Output when the encoding is converted into a consistent output can be normal

>>> print u ' Chinese '. Encode (' GBK ')???? >>> print u ' Chinese '. Encode (' Utf-8 ') Chinese

Suggestions

Code Code

Unified coding to prevent garbled characters from a certain link

Environment coding, ide/text editor, file encoding, database data table encoding

Ensure code Source file encoding

This is important.

PY file default encoding is ASCII, in the source code file, if the use of non-ASCII characters, you need to code the file header to declare the document

If not declared, enter non-ASCII errors encountered, must be placed in the first row of the file or the second line

File "xxx.py", line 3 syntaxerror:non-ascii character ' \xd6 ' in file c.py on line 3, but no encoding declared; See http://www.python.org/peps/pep-0263.html for details

Declaring methods

#-*-Coding:utf-8-*-or #coding =utf-8

If the head is declared coding=utf-8, a = ' Chinese ' is encoded as Utf-8

If the head is declared coding=gb2312, a = ' Chinese ' is encoded as GBK

So, the header of all source files in the same project is one encoding, and the encoding to be declared is consistent with the encoding of the source file (Editor-related)

A hard-coded string that is used as processing in the source code, uniformly Unicode

Isolate the encoding of its type from the source file itself, independent of the convenience of each location in the process

if s = = U ' Chinese ':  #而不是 s = = ' Chinese '     pass #注意这里 s When you're here, make sure to convert to Unicode

After the steps above, you only need to focus on two Unicode and the encoding you set (typically using UTF-8)

Processing order

1. Decode early 2. Unicode everywhere 3. Encode later

Related modules and some methods

Obtaining and setting the system default encoding

>>> Import sys >>> sys.getdefaultencoding () ' ASCII '  >>> reload (SYS) <module ' sys ' ( Built-in) > >>> sys.setdefaultencoding (' utf-8 ') >>> sys.getdefaultencoding () ' Utf-8 ' >> > Str.encode (' other_coding ')

In Python, a coded str is encode directly into another encoding str

The operation performed by #str_A为utf-8 Str_a.encode (' GBK ') is Str_a.decode (' Sys_codec '). Encode (' GBK ') here Sys_codec is the previous step Encoding of the sys.getdefaultencoding ()

' Get and set system default encoding ' and here's the Str.encode is related, but I generally rarely use this, mainly because the complex is not controlled, or the input is clear decode, the output of a clear encode come easy

Chardet

File encoding detection, download

>>> import chardet >>> f = open (' Test.txt ', ' r ') >>> result = Chardet.detect (F.read ()) >> > Result {' confidence ': 0.99, ' encoding ': ' Utf-8 '}

\u string to correspond to a Unicode string

>>> u ' in ' u ' \u4e2d '  >>> s = ' \u4e2d ' >>> print s.decode (' Unicode_escape ')  >>& Gt A = ' \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529f ' >>> a.decode (' Unicode_escape ') u ' \u4fee\ u6539\u8282\u70b9\u72b6\u6001\u6210\u529f '

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More