Methods of processing str and Unicode in Python

Last Update:2017-11-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2015/03/25 · Basic Knowledge · 3 Reviews · Python

share to: Source: liuaiqi627 's Blog

It is a headache to deal with Chinese in python2.x. Write this article on the net, the measurement is not homogeneous, and will be a bit wrong, so here intends to summarize an article.

I will also learn in the future, and constantly revise this blog.

This assumes that the reader already has the basic knowledge associated with coding, and this article is no longer introduced again, including what is utf-8, what is Unicode, and what is the relationship between them.

STR and byte code

First, we don't talk about Unicode at all.

Python

1	s = "Life is too Short"

S is a string, which itself stores a byte code. So what format is this bytecode?

If this code is entered on the interpreter, the format of this s is the interpreter's encoding format, which is GBK for Windows CMD.

If the segment code is executed after it is saved, such as storage as utf-8, then the interpreter will initialize the S to utf-8 encoding when it is loaded into the program.

Unicode and Str

We know that Unicode is a coding standard, and the specific implementation criteria may be UTF-8,UTF-16,GBK ...

Python uses two bytes internally to store a Unicode, and the advantage of using Unicode objects instead of STR is that Unicode facilitates cross-platform.

You can define a Unicode in the following two ways:

Python

12	S1 = u"Life is short" s2 = Unicode("Life is short", "Utf-8")

Encode and Decode

The encoding decoding in Python is this:

So we can write the code like this:

Python

123456789

#-*-Coding:utf-8-*-su = "Life is too Short" #: Su is a byte string in utf-8 formatu = s. Decode("Utf-8") #: S is decoded to Unicode object, assigned to usg = u. Encode("GBK") #: U is encoded as a byte string in GBK format, assigned to SGprint sg # Print SG

But the facts are more complicated than this, like the following code:

Python

12	s = "Life is too Short" s. Encode(' GBK ')

See! STR can also encode, (in fact, Unicode objects can decode, but not very significant)

Why is this possible? Look at the arrow of the coding process, you can think of the principle that when you encode STR, you will first decode yourself to Unicode with the default encoding, and then encode the Unicode encoding for you.

This leads to the python2.x in the processing of Chinese, most of the reasons for the error: Python's default encoding, Defaultencoding is ASCII

See this example:

Python

123	#--Coding:utf-8-- s = "Life is too Short" s. Encode(' GBK ')

The above code will error message: unicodedecodeerror: ' ASCII ' codec can ' t decode byte ...

Because you didn't specify defaultencoding, so it's actually doing something like this:

Python

123	#--Coding:utf-8-- s = "Life is too Short" s. Decode(' ASCII '). Encode(' GBK ')

Set defaultencoding

The code for setting defaultencoding is as follows:

Python

12	Reload(sys) sys. Setdefaultencoding(' utf-8 ')

If you encode and decode in Python without specifying the encoding, Python uses defaultencoding.

For example, in the previous section, where STR is encoded in another format, defaultencoding is used.

Python

1	S. Encode("Utf-8") is equivalent to s. Decode(defaultencoding). Encode("Utf-8")

If you use STR to create a Unicode object, the program will also use defaultencoding if you do not specify the encoding format for this str.

Python

1	U = Unicode("Life is too Short") is equivalent to u = Unicode("Life is too Short" ,defaultencoding)

The default defaultcoding:ascii is the cause of many errors, so setting up defaultencoding early is a good habit.

The file header declares the function of the encoding.

This is thanks to this blog about the Python file header part of the knowledge of the explanation.

Top: #-*-Coding:utf-8-*-currently appears to have three effects.

This declaration is required if there is a Chinese comment in the code
A more advanced editor, such as my Emacs, will format this as a code file based on the header declaration.
The program will initialize the "Life is too short" by the head Declaration, decoding the Unicode object, (so the header declaration and the code are stored in the same format)

About the Requests library

Requests is a useful Python HTTP client library that is often used when writing crawlers and testing server response data.

The Request object will return an response object after accessing the server, which will save the returned HTTP response bytecode to the content property.

However, if you access another property text, it will return a Unicode object, garbled problem will often be sent here.

Because the response object encodes the bytecode into Unicode through another property encoding, this encoding property is responses to guess.

Official documents:

Text
Content of the response, in Unicode.

If response.encoding is None, encoding'll be guessed using Chardet.

The encoding of the response content is determined based solely in HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-http knowledge to make a better guess at the encoding, you should set r.encoding Appropri Ately before accessing.

So either you use the content (bytecode) directly or remember to set the encoding correctly, for example, if I get a GBK encoded page, I need the following methods to get the correct Unicode.

Python

123456

Import requests URL = "http://xxx.xxx.xxx" response = requests. Get(URL) response. Encoding = ' gbk ' Print response. Text

Not only to the principle, but also to use the method!

If it was early I wrote a blog, then I would certainly write this example:

If the current file encoding is GBK, then the file header is: #-*-Coding:utf-8-*-, and then set the default encoding to XXX, then the results of the following program will be ...

This is similar to when learning C, with a variety of priorities, binding, pointers to show their level of code.

Actually these are not practical at all, who will write such code in the real work? I'm here to talk about practical python methods for handling Chinese.

Basic settings

Proactively set defaultencoding. (the default is ASCII)

The save format of the code file is consistent with the # Coding:xxx of the file header

If it is in Chinese, use Unicode as much as possible inside the program without STR

About printing

When you print str, you are actually sending the byte stream directly to the shell. If your byte-stream encoding format is not the same as the code format of the shell, it will be garbled.

While you are printing Unicode, the system will automatically encode it as the shell encoding format, there will be no garbled.

internal and external procedures to unify

If you want to ensure that only Unicode is used inside the program, you must convert the byte stream to Unicode when reading from the outside, and then process Unicode in the later code instead of Str.

Python

12345

with open("test") as F: For i in F: # decode the read-in utf-8 byte stream u = i. Decode(' utf-8 ') .. . .

If the data stream inside and outside of the connection program is likened to the channel, then instead of the channel into a byte stream, read into the decoding, it is better to directly open the channel to Unicode.

Python

12345

# Use codecs to open Unicode channels directlyfile = codecs. Open("test", "R", "Utf-8") for i in file: print type(i) # The type of I is Unicode

So the key to Python's problem with Chinese coding is that you need to be clear about what you're doing, what format you're going to read, what format these bytes are, how str is converted to Unicode, and how one encoding from STR to another is done. What's more, you can't confuse the problem with your own initiative to maintain a unity.

Methods of processing str and Unicode in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Methods of processing str and Unicode in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Methods of processing str and Unicode in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support