Python code deals with the distinction between-STR and Unicode _python

Source: Internet
Author: User
Tags stdin in python

A good article on STR and Unicode

To sort out the Python code-related content

Note: The following discussion is for the python2.x version, py3k to be tried

Begin

When handling Chinese in Python, read files or messages, HTTP parameters, and so on

A run, found garbled (string processing, read-write file, print)

Then, most people's practice is to invoke Encode/decode for debugging, and not to think clearly why garbled

So the most common errors that occur when debugging

Error 1

Traceback (most recent): File "<stdin>", line 1, in <module> unicodedecodeerror: ' ASCII ' codec can ' t Decode byte 0xe6 in position 0:ordinal not in range (128)

Error 2

Traceback (most recent): File "<stdin>", line 1, in <module> File "/system/library/frameworks/python . framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line, in Decode return Codecs.utf_8_decode (input, error S, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)

First of all

Must have a general concept, understand the next character set, character encoding

ASCII | Unicode | UTF-8 | Wait a minute

Character encoding notes: Ascii,unicode and UTF-8

Taobao Search Technology Blog-Chinese coding

STR and Unicode

Both STR and Unicode are basestring subclasses.

So there's a way to judge whether it's a string

def is_str (s): Return Isinstance (S, basestring)

STR and Unicode conversions

Decode document

Encode document

STR-> decode (' the_coding_of_str ')-> Unicode Unicode-> encode (' the_coding_you_want ')-> str

Difference

STR is a byte string composed of bytes encoded by Unicode (encode)

Declaration mode

s = ' Chinese ' s = U ' Chinese '. Encode (' utf-8 ') >>> type (' Chinese ') <type ' str ' >

Length (number of bytes returned)

>>> u ' Chinese ' encode (' utf-8 ') ' \xe4\xb8\xad\xe6\x96\x87 ' >>> len (U ' Chinese '. Encode (' Utf-8 ')) 6

Unicode is the true string, made up of characters

Declaration mode

s = U ' chinese ' s = ' Chinese '. Decode (' utf-8 ') s = Unicode (' Chinese ', ' utf-8 ') >>> type (U ' Chinese ') <type ' Unicode ' >

To find the length (number of characters to return), the logical

>>> u ' Chinese ' u ' \u4e2d\u6587 ' >>> len (U ' Chinese ') 2

Conclusion

Figuring out whether to deal with STR or Unicode, using the right approach (Str.decode/unicode.encode)

Here's how to judge if it's unicode/str.

>>> isinstance (U ' Chinese, Unicode) True >>> isinstance (' Chinese ', Unicode) False >>> isinstance (' Chinese ', STR) True >>> isinstance (U ' Chinese ', str) False

Simple principle: Do not use encode for STR, do not use decode for Unicode (in fact, str can be encode, specifically in the end, in order to ensure simplicity, not recommended)

>>> ' Chinese '. Encode (' Utf-8 ') traceback (most recent call last): File "<stdin>", line 1, in <module> Unico Dedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128) >>> u ' Chinese '. Decode (' U Tf-8 ') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/system/library/frameworks" /python.framework/versions/2.7/lib/python2.7/encodings/utf_8.py ", line, in Decode return Codecs.utf_8_decode ( Input, errors, True) Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal No in range (128)

Different encoding conversions, using Unicode as intermediate encoding

#s是code_A的str s.decode (' code_a '). Encode (' Code_b ')

File processing, IDE and console

Processing process, you can use this to think of Python as a pool, a portal, an exit

Entrance, all converted to Unicode, the pool all using Unicode processing, exit, and then converted to target encoding (of course, with exceptions, processing logic to use the specific encoding of the case)

Read file external input encoding, decode to Unicode processing (internal encoding, unified Unicode) encode to target output (file or console)

The IDE and console have an error because the encoding and the IDE's own encoding are inconsistent during print

Converts the encoding to a consistent output with normal output

>>> print u ' Chinese '. Encode (' GBK ')???? >>> print u ' Chinese '. Encode (' Utf-8 ') Chinese

Suggestions

Canonical encoding

Unified coding, to prevent a link generated by the garbled

Environment code, ide/text editor, file encoding, database data table coding

Ensure code Source file encoding

This is important.

The default encoding for the py file is ASCII, and in the source code file, if non-ASCII characters are used, the document needs to be encoded in the head of the file

If not stated, input non-ASCII will encounter errors that must be placed in the first or second line of the file

File "xxx.py", line 3 syntaxerror:non-ascii character ' \xd6 ' in file c.py on line 3, but no encoding declared; Http://www.python.org/peps/pep-0263.html for details
Declaring methods

#-*-Coding:utf-8-*-or #coding =utf-8

If the head statement coding=utf-8, a = ' Chinese ' Its encoding is utf-8

If the head statement coding=gb2312, a = ' Chinese ' Its encoding is GBK

So, all source files in the same project have a single encoding for the header, and the code to be declared is consistent with the code saved by the source file (Editor-related)

hard-coded strings used for processing in source code, uniform Unicode

Isolate its type from the encoding of the source file itself, and be independent and easy to handle at all locations in the process

if s = U ' Chinese ': #而不是 s = = ' Chinese ' pass #注意这里 s come here, make sure to become Unicode

After a few steps, you only need to focus on two Unicode and the encoding you set (generally using utf-8)

Processing order

1. Decode early 2. Unicode everywhere 3. Encode later

Related modules and some methods

Get and set system default encoding

>>> Import sys >>> sys.getdefaultencoding () ' ASCII ' >>> reload (SYS) <module ' sys ' (built-i N) > >>> sys.setdefaultencoding (' utf-8 ') >>> sys.getdefaultencoding () ' Utf-8 '
Str.encode (' other_coding ')

In Python, you encode a coded str directly into another encoding str

The operation performed by the #str_A为utf-8 str_a.encode (' GBK ') is Str_a.decode (' Sys_codec '). Encode (' GBK ') here Sys_codec is the last step Encoding of Sys.getdefaultencoding ()

' Get and set the system default code ' is related to the str.encode here, but I usually rarely use it, mainly to feel complex uncontrollable, or input clear decode, the output clear encode come easy (personal view)

Chardet

File encoding detection, downloading

>>> import chardet >>> f = open (' Test.txt ', ' r ') >>> result = Chardet.detect (F.read ()) >> > Result {' confidence ': 0.99, ' encoding ': ' Utf-8 '}

\u string to corresponding Unicode string

>>> u ' in ' u ' \u4e2d ' >>> s = ' \u4e2d ' >>> print s.decode (' Unicode_escape ') >>> a = ' \\u4fee\\u6539\\u8282\\u70b9\\u72b6\\u6001\\u6210\\u529f ' >>> a.decode (' Unicode_escape ') u ' \u4fee\u6539\ u8282\u70b9\u72b6\u6001\u6210\u529f '

The above is the Python code processing data collation, follow-up continue to supplement the relevant information, thank you for your support of this site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.