Methods of processing str and Unicode in Python

Source: Internet
Author: User

Methods of processing str and Unicode in Python

2015/03/25 · Basic Knowledge · 3 Reviews · Python

share to: Source: liuaiqi627 's Blog

It is a headache to deal with Chinese in python2.x. Write this article on the net, the measurement is not homogeneous, and will be a bit wrong, so here intends to summarize an article.

I will also learn in the future, and constantly revise this blog.

This assumes that the reader already has the basic knowledge associated with coding, and this article is no longer introduced again, including what is utf-8, what is Unicode, and what is the relationship between them.

STR and byte code

First, we don't talk about Unicode at all.

Python
1 s = "Life is too Short"

S is a string, which itself stores a byte code. So what format is this bytecode?

If this code is entered on the interpreter, the format of this s is the interpreter's encoding format, which is GBK for Windows CMD.

If the segment code is executed after it is saved, such as storage as utf-8, then the interpreter will initialize the S to utf-8 encoding when it is loaded into the program.

Unicode and Str

We know that Unicode is a coding standard, and the specific implementation criteria may be UTF-8,UTF-16,GBK ...

Python uses two bytes internally to store a Unicode, and the advantage of using Unicode objects instead of STR is that Unicode facilitates cross-platform.

You can define a Unicode in the following two ways:

Python
12 S1 = u"Life is short" s2 = Unicode("Life is short", "Utf-8")

Encode and Decode

The encoding decoding in Python is this:

So we can write the code like this:

Python
123456789 #-*-Coding:utf-8-*-su = "Life is too Short" #: Su is a byte string in utf-8 formatu = s. Decode("Utf-8") #: S is decoded to Unicode object, assigned to usg = u. Encode("GBK") #: U is encoded as a byte string in GBK format, assigned to SGprint sg # Print SG

But the facts are more complicated than this, like the following code:

Python
12 s = "Life is too Short" s. Encode(' GBK ')

See! STR can also encode, (in fact, Unicode objects can decode, but not very significant)

Why is this possible? Look at the arrow of the coding process, you can think of the principle that when you encode STR, you will first decode yourself to Unicode with the default encoding, and then encode the Unicode encoding for you.

This leads to the python2.x in the processing of Chinese, most of the reasons for the error: Python's default encoding, Defaultencoding is ASCII

See this example:

Python
123 #-*-Coding:utf-8-*- s = "Life is too Short" s. Encode(' GBK ')

The above code will error message: unicodedecodeerror: ' ASCII ' codec can ' t decode byte ...

Because you didn't specify defaultencoding, so it's actually doing something like this:

Python
123 #-*-Coding:utf-8-*- s = "Life is too Short" s. Decode(' ASCII '). Encode(' GBK ')

Set defaultencoding

The code for setting defaultencoding is as follows:

Python
12 Reload(sys) sys. Setdefaultencoding(' utf-8 ')

If you encode and decode in Python without specifying the encoding, Python uses defaultencoding.

For example, in the previous section, where STR is encoded in another format, defaultencoding is used.

Python
1 S. Encode("Utf-8") is equivalent to s. Decode(defaultencoding). Encode("Utf-8")

If you use STR to create a Unicode object, the program will also use defaultencoding if you do not specify the encoding format for this str.

Python
1 U = Unicode("Life is too Short") is equivalent to u = Unicode("Life is too Short" ,defaultencoding)

The default defaultcoding:ascii is the cause of many errors, so setting up defaultencoding early is a good habit.

The file header declares the function of the encoding.

This is thanks to this blog about the Python file header part of the knowledge of the explanation.

Top: #-*-Coding:utf-8-*-currently appears to have three effects.

    1. This declaration is required if there is a Chinese comment in the code
    2. A more advanced editor, such as my Emacs, will format this as a code file based on the header declaration.
    3. The program will initialize the "Life is too short" by the head Declaration, decoding the Unicode object, (so the header declaration and the code are stored in the same format)
About the Requests library

Requests is a useful Python HTTP client library that is often used when writing crawlers and testing server response data.

The Request object will return an response object after accessing the server, which will save the returned HTTP response bytecode to the content property.

However, if you access another property text, it will return a Unicode object, garbled problem will often be sent here.

Because the response object encodes the bytecode into Unicode through another property encoding, this encoding property is responses to guess.

Official documents:

Text
Content of the response, in Unicode.

If response.encoding is None, encoding'll be guessed using Chardet.

The encoding of the response content is determined based solely in HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-http knowledge to make a better guess at the encoding, you should set r.encoding Appropri Ately before accessing.

So either you use the content (bytecode) directly or remember to set the encoding correctly, for example, if I get a GBK encoded page, I need the following methods to get the correct Unicode.

Python
123456 Import requests URL = "http://xxx.xxx.xxx" response = requests. Get(URL) response. Encoding = ' gbk ' Print response. Text

Not only to the principle, but also to use the method!

If it was early I wrote a blog, then I would certainly write this example:

If the current file encoding is GBK, then the file header is: #-*-Coding:utf-8-*-, and then set the default encoding to XXX, then the results of the following program will be ...

This is similar to when learning C, with a variety of priorities, binding, pointers to show their level of code.

Actually these are not practical at all, who will write such code in the real work? I'm here to talk about practical python methods for handling Chinese.

Basic settings

Proactively set defaultencoding. (the default is ASCII)

The save format of the code file is consistent with the # Coding:xxx of the file header

If it is in Chinese, use Unicode as much as possible inside the program without STR

About printing

When you print str, you are actually sending the byte stream directly to the shell. If your byte-stream encoding format is not the same as the code format of the shell, it will be garbled.

While you are printing Unicode, the system will automatically encode it as the shell encoding format, there will be no garbled.

internal and external procedures to unify

If you want to ensure that only Unicode is used inside the program, you must convert the byte stream to Unicode when reading from the outside, and then process Unicode in the later code instead of Str.

Python
12345 with open("test") as F: For i in F: # decode the read-in utf-8 byte stream u = i. Decode(' utf-8 ')         .. . .

If the data stream inside and outside of the connection program is likened to the channel, then instead of the channel into a byte stream, read into the decoding, it is better to directly open the channel to Unicode.

Python
12345 # Use codecs to open Unicode channels directlyfile = codecs. Open("test", "R", "Utf-8") for i in file: print type(i) # The type of I is Unicode

So the key to Python's problem with Chinese coding is that you need to be clear about what you're doing, what format you're going to read, what format these bytes are, how str is converted to Unicode, and how one encoding from STR to another is done. What's more, you can't confuse the problem with your own initiative to maintain a unity.

Methods of processing str and Unicode in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.