Solution to unicode problems related to str in Python2.x, python2.xunicode
Processing Chinese in python2.x is a headache. I am going to summarize this article here because I am not doing well in the tests and it will be a bit wrong.
I will also continue to modify this blog in the future.
Here, we assume that the reader has basic encoding-related knowledge. This article will not introduce again, including what is UTF-8, what is unicode, and what is the relationship between them.
Str and bytecode
First, we will not talk about unicode at all.
S = "Life is short"
S is a string that stores bytecode. So what format is this bytecode?
If this code is entered on the interpreter, the format of s is the interpreter's encoding format. For windows cmd, It is gbk.
If the code segment is saved before execution, for example, stored as UTF-8, s will be initialized to UTF-8 encoding when the interpreter loads this program.
Unicode and str
We know unicode is an encoding standard, the specific implementation standard may be UTF-8, UTF-16, gbk ......
Python uses two bytes internally to store a unicode. The advantage of using unicode objects instead of str is that unicode is convenient across platforms.
You can define a unicode in the following two ways:
S1 = u "Life Bitter short" s2 = unicode ("Life Bitter short", "UTF-8 ")
Encode and decode
The encoding and decoding in python is as follows:
Therefore, we can write the following code:
#-*-Coding: UTF-8-*-su = "life bitter" #: su is a UTF-8 byte string u = s. decode ("UTF-8") #: s is decoded as a unicode object and assigned to usg = u. encode ("gbk") #: The u is encoded as a byte string in gbk format and assigned to suplint sg # print sg
But the fact is more complex than this, for example, look at the following code:
S = "Life is short" s. encode ('gbk ')
Look! Str can also be encoded. (In fact, unicode objects can also be decoded, but it is of little significance)
So why? Look at the arrow of the encoding process, you can think of the principle, when encoding str, will first use the default encoding to decode itself into unicode, and then specify the unicode encoding for you.
This leads to the cause of most errors when processing Chinese Characters in python2.x: python's default encoding, defaultencoding is ascii
Let's look at this example:
#-*-Coding: UTF-8-*-s = "poor life" s. encode ('gbk ')
The above code will report an error. error message: UnicodeDecodeError: 'ascii 'codec can't decode byte ......
Because defaultencoding is not specified, it is actually doing the following:
#-*-Coding: UTF-8-*-s = "poor life" s. decode ('ascii '). encode ('gbk ')
Set defaultencoding
The code for setting defaultencoding is as follows:
reload(sys)sys.setdefaultencoding('utf-8')
If you do not specify the encoding method when encoding and decoding in python, python uses defaultencoding.
For example, if str is encoded in another format in the previous example, defaultencoding is used.
S. encode ("UTF-8") is equivalent to s. decode (defaultencoding). encode ("UTF-8 ")
For example, if you do not describe the str encoding format when using str to create a unicode object, the program will also use defaultencoding.
U = unicode ("Life is short") is equivalent to u = unicode ("Life is short", defaultencoding)
Default cococoding: ascii is the cause of many errors, so setting defaultencoding early is a good habit.
The role of the file header to declare encoding.
I would like to thank this blog for its explanation of the python file header.
At the top of the page: #-*-coding: UTF-8-*-currently, it seems to have three functions.
This statement is required if the Code contains a Chinese annotation.
A relatively advanced Editor (such as my emacs) uses this as the code file format according to the header declaration.
The program will declare through the header and decode the initialization of u. This unicode object (so the header Declaration must be consistent with the storage format of the Code)
About the requests Library
Requests is a very practical Python HTTP client library, which is often used to compile crawlers and test server response data.
The Request object returns a Response object after accessing the server. This object saves the returned Http Response bytecode to the content attribute.
However, if you access another attribute text, a unicode object will be returned, and garbled characters will often be generated here.
Because the Response object uses the encoding attribute to encode the bytecode into unicode, the encoding attribute is guessed by responses.
Official documentation:
Text
Content of the response, in unicode.
If Response. encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. if you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r. encoding appropriately before accessing this property.
So either you directly use content (bytecode) or remember to set the encoding correctly. For example, if I have obtained a gbk-encoded webpage, the following methods are required to obtain the correct unicode.
import requestsurl = "http://xxx.xxx.xxx"response = requests.get(url)response.encoding = 'gbk' print response.text
It is not only about principles, but also about how to use them!
If I wrote a blog in the early days, I would write such an example:
If the current file is encoded as gbk, the file header is #-*-coding: UTF-8-*-, and the default encoding is set to xxx, then the result of the following program will be ......
This is similar to that when I learned c, I used various priorities, associativity, and pointers to demonstrate my own level of code.
In fact, these are not practical at all. Who will write such code in real work? Here I want to talk about practical python methods for processing Chinese characters.
Basic settings
Set defaultencoding. (The default value is ascii)
The code file must be saved in the same format as # coding: xxx in the file header.
If it is Chinese, use unicode as much as possible in the program, instead of str
About Printing
When you print str, you actually send the byte stream directly to the shell. If your byte stream encoding format is different from the shell encoding format, it will be garbled.
When you print unicode, the system automatically encodes it into the shell encoding format without garbled characters.
Internal and external applications should be unified
If you want to ensure that only unicode is used in the program, you must convert these bytes into unicode when reading from the outside, such as byte streams, and process unicode in the subsequent code instead of str.
With open ("test") as f: for I in f: # decode the read UTF-8 byte stream u = I. decode ('utf-8 ')....
If you compare the internal and external data streams of the Connection Program to the channel, rather than opening the channel as a byte stream and reading it for decoding, it is better to directly open the channel to unicode.
# Use codecs to directly open the unicode channel file = codecs. open ("test", "r", "UTF-8") for I in file: print type (I) # The I type is unicode
Therefore, the key to processing Chinese encoding in python is to clearly understand what you are doing, what formats you want to read, and what formats these bytes are declared, how is str converted to unicode? How is one encoding of str converted to another. In addition, you cannot make the problem messy. You must take the initiative to maintain a unity.