Methods of processing str and Unicode in Python
2015/03/25 · Basic Knowledge · 3 Reviews · Python
share to: Source: liuaiqi627 's Blog
It is a headache to deal with Chinese in python2.x. Write this article on the net, the measurement is not homogeneous, and will be a bit wrong, so here intends to summarize an article.
I will also learn in the future, and constantly revise this blog.
This assumes that the reader already has the basic knowledge associated with coding, and this article is no longer introduced again, including what is utf-8, what is Unicode, and what is the relationship between them.
STR and byte code
First, we don't talk about Unicode at all.
Python
1 |
s = "Life is too Short" |
S is a string, which itself stores a byte code. So what format is this bytecode?
If this code is entered on the interpreter, the format of this s is the interpreter's encoding format, which is GBK for Windows CMD.
If the segment code is executed after it is saved, such as storage as utf-8, then the interpreter will initialize the S to utf-8 encoding when it is loaded into the program.
Unicode and Str
We know that Unicode is a coding standard, and the specific implementation criteria may be UTF-8,UTF-16,GBK ...
Python uses two bytes internally to store a Unicode, and the advantage of using Unicode objects instead of STR is that Unicode facilitates cross-platform.
You can define a Unicode in the following two ways:
Python
12 |
S1 = u"Life is short" s2 = Unicode("Life is short", "Utf-8") |
Encode and Decode
The encoding decoding in Python is this:
So we can write the code like this:
Python
123456789 |
#-*-Coding:utf-8-*-su = "Life is too Short" #: Su is a byte string in utf-8 formatu = s. Decode("Utf-8") #: S is decoded to Unicode object, assigned to usg = u. Encode("GBK") #: U is encoded as a byte string in GBK format, assigned to SGprint sg # Print SG |
But the facts are more complicated than this, like the following code:
Python
12 |
s = "Life is too Short" s. Encode(' GBK ') |
See! STR can also encode, (in fact, Unicode objects can decode, but not very significant)
Why is this possible? Look at the arrow of the coding process, you can think of the principle that when you encode STR, you will first decode yourself to Unicode with the default encoding, and then encode the Unicode encoding for you.
This leads to the python2.x in the processing of Chinese, most of the reasons for the error: Python's default encoding, Defaultencoding is ASCII
See this example:
Python
123 |
#-*-Coding:utf-8-*- s = "Life is too Short" s. Encode(' GBK ') |
The above code will error message: unicodedecodeerror: ' ASCII ' codec can ' t decode byte ...
Because you didn't specify defaultencoding, so it's actually doing something like this:
Python
123 |
#-*-Coding:utf-8-*- s = "Life is too Short" s. Decode(' ASCII '). Encode(' GBK ') |
Set defaultencoding
The code for setting defaultencoding is as follows:
Python
12 |
Reload(sys) sys. Setdefaultencoding(' utf-8 ') |
If you encode and decode in Python without specifying the encoding, Python uses defaultencoding.
For example, in the previous section, where STR is encoded in another format, defaultencoding is used.
Python
1 |
S. Encode("Utf-8") is equivalent to s. Decode(defaultencoding). Encode("Utf-8") |
If you use STR to create a Unicode object, the program will also use defaultencoding if you do not specify the encoding format for this str.
Python
1 |
U = Unicode("Life is too Short") is equivalent to u = Unicode("Life is too Short" ,defaultencoding) |
The default defaultcoding:ascii is the cause of many errors, so setting up defaultencoding early is a good habit.
The file header declares the function of the encoding.
This is thanks to this blog about the Python file header part of the knowledge of the explanation.
Top: #-*-Coding:utf-8-*-currently appears to have three effects.
- This declaration is required if there is a Chinese comment in the code
- A more advanced editor, such as my Emacs, will format this as a code file based on the header declaration.
- The program will initialize the "Life is too short" by the head Declaration, decoding the Unicode object, (so the header declaration and the code are stored in the same format)
About the Requests library
Requests is a useful Python HTTP client library that is often used when writing crawlers and testing server response data.
The Request object will return an response object after accessing the server, which will save the returned HTTP response bytecode to the content property.
However, if you access another property text, it will return a Unicode object, garbled problem will often be sent here.
Because the response object encodes the bytecode into Unicode through another property encoding, this encoding property is responses to guess.
Official documents:
Text
Content of the response, in Unicode.
If response.encoding is None, encoding'll be guessed using Chardet.
The encoding of the response content is determined based solely in HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-http knowledge to make a better guess at the encoding, you should set r.encoding Appropri Ately before accessing.
So either you use the content (bytecode) directly or remember to set the encoding correctly, for example, if I get a GBK encoded page, I need the following methods to get the correct Unicode.
Python
123456 |
Import requests URL = "http://xxx.xxx.xxx" response = requests. Get(URL) response. Encoding = ' gbk ' Print response. Text |
Not only to the principle, but also to use the method!
If it was early I wrote a blog, then I would certainly write this example:
If the current file encoding is GBK, then the file header is: #-*-Coding:utf-8-*-, and then set the default encoding to XXX, then the results of the following program will be ...
This is similar to when learning C, with a variety of priorities, binding, pointers to show their level of code.
Actually these are not practical at all, who will write such code in the real work? I'm here to talk about practical python methods for handling Chinese.
Basic settings
Proactively set defaultencoding. (the default is ASCII)
The save format of the code file is consistent with the # Coding:xxx of the file header
If it is in Chinese, use Unicode as much as possible inside the program without STR
About printing
When you print str, you are actually sending the byte stream directly to the shell. If your byte-stream encoding format is not the same as the code format of the shell, it will be garbled.
While you are printing Unicode, the system will automatically encode it as the shell encoding format, there will be no garbled.
internal and external procedures to unify
If you want to ensure that only Unicode is used inside the program, you must convert the byte stream to Unicode when reading from the outside, and then process Unicode in the later code instead of Str.
Python
12345 |
with open("test") as F: For i in F: # decode the read-in utf-8 byte stream u = i. Decode(' utf-8 ') .. . . |
If the data stream inside and outside of the connection program is likened to the channel, then instead of the channel into a byte stream, read into the decoding, it is better to directly open the channel to Unicode.
Python
12345 |
# Use codecs to open Unicode channels directlyfile = codecs. Open("test", "R", "Utf-8") for i in file: print type(i) # The type of I is Unicode |
So the key to Python's problem with Chinese coding is that you need to be clear about what you're doing, what format you're going to read, what format these bytes are, how str is converted to Unicode, and how one encoding from STR to another is done. What's more, you can't confuse the problem with your own initiative to maintain a unity.
Methods of processing str and Unicode in Python