This article describes how to solve the default encoding problem of python2.x. it is a headache to process Chinese characters in python2.x. I am going to summarize this article here because I am not doing well in the tests and it will be a bit wrong.
I will also continue to modify this blog in the future.
Here, we assume that the reader has basic encoding-related knowledge. This article will not introduce again, including what is UTF-8, what is unicode, and what is the relationship between them.
Str and bytecode
First, we will not talk about unicode at all.
S = "Life is short"
S is a string that stores bytecode. So what format is this bytecode?
If this code is entered on the interpreter, the format of s is the interpreter's encoding format. for windows cmd, it is gbk.
If the code segment is saved before execution, for example, stored as UTF-8, s will be initialized to UTF-8 encoding when the interpreter loads this program.
Unicode and str
We know unicode is an encoding standard, the specific implementation standard may be UTF-8, UTF-16, gbk ......
Python uses two bytes internally to store a unicode. the advantage of using unicode objects instead of str is that unicode is convenient across platforms.
You can define a unicode in the following two ways:
S1 = u "life bitter short" s2 = unicode ("life bitter short", "UTF-8 ")
Encode and decode
The encoding and decoding in python is as follows:
#-*-Coding: UTF-8-*-su = "life bitter" #: su is a UTF-8 byte string u = s. decode ("UTF-8") #: s is decoded as a unicode object and assigned to usg = u. encode ("gbk") #: The u is encoded as a byte string in gbk format and assigned to suplint sg # print sg
But the fact is more complex than this, for example, look at the following code:
S = "Life is short" s. encode ('gbk ')
Look! Str can also be encoded. (in fact, unicode objects can also be decoded, but it is of little significance)
So why? Look at the arrow of the encoding process, you can think of the principle, when encoding str, will first use the default encoding to decode itself into unicode, and then specify the unicode encoding for you.
This leads to the cause of most errors when processing Chinese characters in python2.x: python's default encoding, defaultencoding is ascii
Let's look at this example:
#-*-Coding: UTF-8-*-s = "poor life" s. encode ('gbk ')
The above code will report an error. error message: UnicodeDecodeError: 'ascii 'codec can't decode byte ......
Because defaultencoding is not specified, it is actually doing the following:
#-*-Coding: UTF-8-*-s = "poor life" s. decode ('ascii '). encode ('gbk ')
Set defaultencoding
The code for setting defaultencoding is as follows:
reload(sys)sys.setdefaultencoding('utf-8')
If you do not specify the encoding method when encoding and decoding in python, python uses defaultencoding.
For example, if str is encoded in another format in the previous example, defaultencoding is used.
S. encode ("UTF-8") is equivalent to s. decode (defaultencoding). encode ("UTF-8 ")
For example, if you do not describe the str encoding format when using str to create a unicode object, the program will also use defaultencoding.
U = unicode ("Life is short") is equivalent to u = unicode ("Life is short", defaultencoding)
Default cococoding: ascii is the cause of many errors, so setting defaultencoding early is a good habit.
The role of the file header to declare encoding.
I would like to thank this blog for its explanation of the python file header.
At the top of the page: #-*-coding: UTF-8-*-currently, it seems to have three functions.
This statement is required if the code contains a Chinese annotation.
A relatively advanced editor (such as my emacs) uses this as the code file format according to the header declaration.
The program will declare through the header and decode the initialization of u. This unicode object (so the header declaration must be consistent with the storage format of the code)
About the requests Library
Requests is a very practical Python HTTP client library, which is often used to compile crawlers and test server response data.
The Request object returns a Response object after accessing the server. this object saves the returned Http Response bytecode to the content attribute.
However, if you access another attribute text, a unicode object will be returned, and garbled characters will often be generated here.
Because the Response object uses the encoding attribute to encode the bytecode into unicode, the encoding attribute is guessed by responses.
Official documentation:
Text
Content of the response, in unicode.
If Response. encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. if you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r. encoding appropriately before accessing this property.
So either you directly use content (bytecode) or remember to set the encoding correctly. for example, if I have obtained a gbk-encoded webpage, the following methods are required to obtain the correct unicode.
import requestsurl = "http://xxx.xxx.xxx"response = requests.get(url)response.encoding = 'gbk' print response.text
If I wrote a blog in the early days, I would write such an example:It is not only about principles, but also about how to use them!
If the current file is encoded as gbk, the file header is #-*-coding: UTF-8-*-, and the default encoding is set to xxx, then the result of the following program will be ......
This is similar to that when I learned c, I used various priorities, associativity, and pointers to demonstrate my own level of code.
In fact, these are not practical at all. who will write such code in real work? Here I want to talk about practical python methods for processing Chinese characters.
Basic settings
Set defaultencoding. (The default value is ascii)
The code file must be saved in the same format as # coding: xxx in the file header.
If it is Chinese, use unicode as much as possible in the program, instead of str
About printing
When you print str, you actually send the byte stream directly to the shell. If your byte stream encoding format is different from the shell encoding format, it will be garbled.
When you print unicode, the system automatically encodes it into the shell encoding format without garbled characters.
Internal and external applications should be unified
If you want to ensure that only unicode is used in the program, you must convert these bytes into unicode when reading from the outside, such as byte streams, and process unicode in the subsequent code instead of str.
With open ("test") as f: for I in f: # decode the read UTF-8 byte stream u = I. decode ('utf-8 ')....
If you compare the internal and external data streams of the Connection program to the channel, rather than opening the channel as a byte stream and reading it for decoding, it is better to directly open the channel to unicode.
# Use codecs to directly open the unicode Channel file = codecs. open ("test", "r", "UTF-8") for I in file: print type (I) # The I type is unicode
Therefore, the key to processing Chinese encoding in python is to clearly understand what you are doing, what formats you want to read, and what formats these bytes are declared, how is str converted to unicode? how is one encoding of str converted to another. In addition, you cannot make the problem messy. you must take the initiative to maintain a unity.
The major difference between python 3 and python 2 is that python itself uses unicode encoding by default.
Strings are no longer differentiated from "abc" and u "abc". Strings "abc" are unicode by default and do not represent local encoding,
Because of this internal encoding, similar to c # and java, there is no need to perform similar encoding in the language environment, such as "sys. setdefaultencoding ";
Therefore, python 3's code and package management breaks the compatibility with 2.x. 2. the extension package of x must adapt to this situation.
Another problem is how unicode outputs local encoding such as gbk in the language environment.
1. if you do not specify the encoding method when encoding and decoding in Python, python will use defaultencoding. The defaultencoding of python2.x is ascii,
This is why the following error occurs in most python codes: "UnicodeDecodeError: 'ascii 'codec can't decode byte.
2. about the header # coding: UTF-8, which has the following functions: 2.1 if the code has a Chinese annotation, you need to declare the 2.2 more advanced editor (such as my emacs) according to the header declaration, use this as the format of the code file. 2.3 The program will declare through the header and decode the initialization u "life is bitter", such a unicode object (so the header declaration must be consistent with the storage format of the code)
Python2.7 and later do not use setdefaultencoding. there is no difference between the two.
These two functions are different. 1.# coding:utf-8
The role is to define the Source Code encoding. if not defined, this Source Code cannot contain Chinese strings. PEP 0263 -- Defining Python Source Code Encodings https://www.python.org/dev/peps/pep-0263/ 2.sys.getdefaultencoding()
Is to set the default string encoding format
Answer: It is customary to convert the local encoding only when the output is serialized.
The above is the details of the solution to the default encoding of python2.x. For more information, see other related articles in the first PHP community!