The solution of STR and Unicode related problems in python2.x

The solution of STR and Unicode related problems in python2.x _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags in python advantage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It is a headache to deal with Chinese in python2.x. On the Internet to write this aspect of the article, the test time is not neat, and will be a bit wrong, so here intend to summarize an article.

I will also study in the future, and constantly modify this blog.

This assumes that the reader has a basic knowledge of the encoding, and this article is no longer introduced, including what is utf-8, what is Unicode, and what is the relationship between them.
STR and byte code

First of all, we don't talk about Unicode at all.

s = "Life is short"

S is a string, and it stores the byte code itself. So what is the format of this byte code?

If this code is entered on the interpreter, then the S format is the interpreter's encoding format, which is GBK for Windows CMD.

If the segment code is saved before it is executed, for example, stored as utf-8, then when the interpreter loads the program, the S is initialized to UTF-8 encoding.
Unicode and Str

We know that Unicode is a coding standard, the specific implementation criteria may be UTF-8,UTF-16,GBK ...

Python uses two bytes internally to store a Unicode, and the advantage of using Unicode objects instead of STR is that Unicode facilitates cross-platform.

You can define a Unicode in the following two ways:

S1 = u "Life is short"
s2 = Unicode ("Life is Short", "Utf-8")

Encode and Decode

The encoding and decoding in Python is this:

So we can write code like this:

#-*-coding:utf-8-*-
su = "Life is short"
#: SU is a utf-8-formatted byte string
u = S.decode ("Utf-8")
#: S is decoded as a Unicode object, assigned to u
sg = U.encode ("GBK")
#: U encoded as a byte string in GBK format, assigned to SG print
SG
# Print SG

But the facts are more complicated than this, like looking at the following code:

s = "Life is short"
s.encode (' GBK ')

See! STR can also encode, (in fact, Unicode objects can also decode, but not very meaningful)

Why would that be? You can think of the principle that when you encode STR, you will first decode yourself to Unicode with the default encoding, and then specify the encoding for you by encoding the Unicode code.

This leads to the reason why most errors occur in python2.x when dealing with Chinese: Python's default encoding, Defaultencoding is ASCII

Look at this example:

#-*-Coding:utf-8-*-
s = "Life is short"
s.encode (' GBK ')

The above code will complain, error message: unicodedecodeerror: ' ASCII ' codec can ' t decode byte ...

Because you don't specify defaultencoding, it's actually doing something like this:

#-*-Coding:utf-8-*-
s = "Life is short"
s.decode (' ASCII '). Encode (' GBK ')

Set defaultencoding

The code for setting defaultencoding is as follows:

Reload (SYS)
sys.setdefaultencoding (' Utf-8 ')

If you encode and decode in Python without specifying the encoding, then Python uses defaultencoding.

For example, in the previous example, STR is encoded in another format, and defaultencoding is used.

S.encode ("Utf-8") is equivalent to S.decode (defaultencoding). Encode ("Utf-8")

For example, if you use str to create a Unicode object, the program will also use defaultencoding if you do not specify the STR encoding format.

u = Unicode ("Life is short") is equivalent to u = Unicode ("Life is short", defaultencoding)

The default defaultcoding:ascii is the cause of many errors, so setting the defaultencoding early is a good habit.
the role of the file header declaration encoding.

This is thanks to this blog about the first part of Python's knowledge of the text.

Top: #-*-Coding:utf-8-*-now appears to have three roles.

If you have a Chinese annotation in your code, you need this declaration
A more advanced editor, such as my Emacs, will be formatted as a code file based on the header declaration.
The program passes the header declaration, decoding the initialization U "Life is short", such a Unicode object (so header declaration and code storage format to be consistent)

About Requests Library

Requests is a very useful Python HTTP client library that is often used when writing crawler and test server response data.

The Request object returns a response object after the server is accessed, which saves the returned HTTP response bytecode to the content property.

However, if you access another attribute text, it returns a Unicode object, and the garbled problem will often occur here.

Because the response object encodes the bytecode into Unicode through another property encoding, the encoding attribute is responses guessed by itself.

Official documents:

Text
Content of the response, in Unicode.

If response.encoding is None, encoding would be guessed using Chardet.

The encoding of the response content is determined based solely on HTTP headers, following RFCs 2616 to the letter. If you can take advantage of non-http knowledge to make a better guess in the encoding, you should set r.encoding Appropri Ately before accessing this property.

So either you use the content (bytecode) directly, or you remember to set the encoding correctly, for example, if I get a GBK encoded page, I need the following methods to get the correct Unicode.

Import requests
URL = "http://xxx.xxx.xxx"
response = requests.get (URL)
response.encoding = ' GBK '
 
Print Response.text

Not only to the principle, but also to use the method!

If it was early in my blog, then I would certainly write such an example:

If the file is now encoded as GBK, then the file header is: #-*-Coding:utf-8-*-, and then set the default encoding to XXX, then the result of the following program will be ...

This is similar to the time when learning C, with a variety of priorities, combinations, pointers to show their level of code.

In fact, these are not practical at all, who will write such code in real work? I'm here to talk about a practical python approach to handling Chinese.

Basic settings

Actively set defaultencoding. (ASCII is the default)

The code file is saved in the same format as the # Coding:xxx of the file header.

If it's Chinese, use Unicode inside the program without STR

About printing

When you print str, you actually send the byte stream directly to the shell. If your byte stream code format is not the same as the shell code format, it will be garbled.

When you print Unicode, the system automatically encodes it as a shell encoding format, there will be no garbled.

To unify inside and outside the program

If you want to ensure that only Unicode is used within the program, then when you read from the outside as a byte stream, be sure to convert the byte stream to Unicode and then deal with Unicode in the code that follows, not str.

With open ("test") as F:
 for I in F:
 # Decoding the read utf-8 byte stream
 u = i.decode (' utf-8 ') ...
 .

If the connection program inside and outside the data flow analogy to channel, then instead of the channel into a byte stream, read into the decoding, as a direct channel open to Unicode.

# Use codecs to open Unicode channel
file = Codecs.open ("Test", "R", "Utf-8") for
i in file:
 print type (i)
 # The type of I is Unicode

So the key to Python's handling of the Chinese coding problem is that you have to clearly understand what you're doing, what format you're going to read, what format you're declaring, what formats you have, how STR is converted to Unicode, and how one of the encodings of STR is encoded in another. Also, you can't confuse the problem and take the initiative to maintain a unity.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More