A detailed analysis of the difference and usage between STR and Unicode in Python code processing

Source: Internet
Author: User
Tags file handling
In Python processing Chinese, reading files or messages, if found garbled (string processing, read and write files, print), most people's practice is to call Encode/decode for debugging, and did not explicitly consider why garbled, today we discuss how to deal with the coding problem.

Note: The following discussion is python2.x version, not tested under PY3K

Errors that occur most frequently during debugging

Error 1

Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe6 in position 0:ordinal not in range (128)

Error 2

Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode     return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

First of all

Must have a general concept to understand the next character set, character encoding

ASCII | Unicode | UTF-8 | Wait a minute

Character-coded notes: Ascii,unicode and UTF-8

STR and Unicode

Both STR and Unicode are subclasses of the basestring

So there's a way to judge if it's a string.

def is_str (s): Return Isinstance (S, basestring)

STR and Unicode conversions

STR---decode (' the_coding_of_str ')--Unicode Unicode---encode (' the_coding_you_want ')


STR is a string of bytes, consisting of Unicode encoded (encode) bytes.

How to declare

>>> s = ' Chinese ' = U ' Chinese '. Encode (' Utf-8 ')  >>> type (' Chinese ') <type ' str ' >

Length (number of bytes returned)

>>> u ' Chinese '. Encode (' utf-8 ') ' \xe4\xb8\xad\xe6\x96\x87 ' >>> len (U ' Chinese ' encode (' Utf-8 ')) 6

Unicode is the real string, made up of characters

How to declare

>>> s = U ' chinese ' >>> s = ' Chinese '. Decode (' Utf-8 ') >>> s = Unicode (' Chinese ', ' utf-8 ')  >>> type (U ' Chinese ') <type ' Unicode ' >

The length (the number of characters returned), which is really desired in logic

>>> u ' Chinese ' u ' \u4e2d\u6587 ' >>> len (U ' Chinese ') 2


Figure out whether to deal with STR or Unicode, using the right processing method (Str.decode/unicode.encode)

Here's how to judge if it's unicode/str

>>> isinstance (U ' Chinese ', Unicode) True >>> isinstance (' Chinese ', Unicode) False  >>> isinstance ( ' Chinese ', str) True >>> isinstance (U ' Chinese ', str) False

Simple principle: Do not use encode to STR, do not use decode for Unicode (in fact, str can be encode, see Finally, in order to ensure simple, not recommended)

>>> ' Chinese ' encode (' Utf-8 ') Traceback (most recent call last): File "<stdin>", line 1, in <module> Unico Dedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range  >>> u ' Chinese '. Decode (' Utf-8 ') Traceback (most recent): File ' <stdin> ', line 1, in <module> file "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line +, in decode     return Codecs.utf_8_decode (input, Errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

Different encoding conversions, using Unicode as the intermediate encoding

#s是code_A的str s.decode (' code_a '). Encode (' Code_b ')

File handling, IDE and console

Process, which can be used as a sink, an entrance, an exit

Entrance, all into Unicode, the pool all using Unicode processing, export, and then into the target code (of course, there are exceptions, the processing logic to use the specific encoding case)

Read the file external input encoding, decode to Unicode processing (internal encoding, unified Unicode) encode to the desired target encoding to write to the target output (file or console)

The IDE and console error occurred because the encoding and IDE encoding inconsistencies caused

Output when the encoding is converted into a consistent output can be normal

>>> print u ' Chinese '. Encode (' GBK ')???? >>> print u ' Chinese '. Encode (' Utf-8 ') Chinese


Code Code

Unified coding to prevent garbled characters from a certain link

Environment coding, ide/text editor, file encoding, database data table encoding

Ensure code Source file encoding

This is important.

PY file default encoding is ASCII, in the source code file, if the use of non-ASCII characters, you need to code the file header to declare the document

If not declared, enter non-ASCII errors encountered, must be placed in the first row of the file or the second line

File "xxx.py", line 3 syntaxerror:non-ascii character ' \xd6 ' in file c.py on line 3, but no encoding declared; See http://www.python.org/peps/pep-0263.html for details

Declaring methods

#-*-Coding:utf-8-*-or #coding =utf-8

If the head is declared coding=utf-8, a = ' Chinese ' is encoded as Utf-8

If the head is declared coding=gb2312, a = ' Chinese ' is encoded as GBK

So, the header of all source files in the same project is one encoding, and the encoding to be declared is consistent with the encoding of the source file (Editor-related)

A hard-coded string that is used as processing in the source code, uniformly Unicode

Isolate the encoding of its type from the source file itself, independent of the convenience of each location in the process

if s = = U ' Chinese ':  #而不是 s = = ' Chinese '     pass #注意这里 s When you're here, make sure to convert to Unicode

After the steps above, you only need to focus on two Unicode and the encoding you set (typically using UTF-8)

Processing order

1. Decode early 2. Unicode everywhere 3. Encode later

Related modules and some methods

Obtaining and setting the system default encoding

>>> Import sys >>> sys.getdefaultencoding () ' ASCII '  >>> reload (SYS) <module ' sys ' ( Built-in) > >>> sys.setdefaultencoding (' utf-8 ') >>> sys.getdefaultencoding () ' Utf-8 ' >> > Str.encode (' other_coding ')

In Python, a coded str is encode directly into another encoding str

The operation performed by #str_A为utf-8 Str_a.encode (' GBK ') is Str_a.decode (' Sys_codec '). Encode (' GBK ') here Sys_codec is the previous step Encoding of the sys.getdefaultencoding ()

' Get and set system default encoding ' and here's the Str.encode is related, but I generally rarely use this, mainly because the complex is not controlled, or the input is clear decode, the output of a clear encode come easy


File encoding detection, download

>>> import chardet >>> f = open (' Test.txt ', ' r ') >>> result = Chardet.detect (F.read ()) >> > Result {' confidence ': 0.99, ' encoding ': ' Utf-8 '}

\u string to correspond to a Unicode string

>>> u ' in ' u ' \u4e2d '  >>> s = ' \u4e2d ' >>> print s.decode (' Unicode_escape ')  >>& Gt A = ' \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529f ' >>> a.decode (' Unicode_escape ') u ' \u4fee\ u6539\u8282\u70b9\u72b6\u6001\u6210\u529f '
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.