Recently, python was used to read a document containing Chinese characters, leading to garbled characters and two errors. I had no choice but to search for answers on the internet. then, I solved the problem through the help of netizens, to sum up, the following article mainly introduces how to solve the problem of Chinese character encoding in python. if you need it, you can refer to it for reference.
Preface
Recently, due to project requirements, you need to read a txt file containing Chinese characters and save the file after it is finished. The document was previously encoded with base64, leading to garbled characters in reading all Chinese characters. After the project team discards base64, two errors occur successively:
ascii codec can't encode characters in position ordinal not in range 128UnicodeDecodeError: ‘utf8' codec can't decode byte 0x。
If you are not familiar with ascii, unicode, and UTF-8, read the previous article about string and encoding.
You must understand the following three concepts:
Ascii can only represent numbers, English letters, and some special characters, not Chinese characters
Unicode and UTF-8 can both represent Chinese characters, unicode is a fixed length, and UTF-8 is a variable length.
In-memory storage is generally unicode, while disk file storage is generally UTF-8, because UTF-8 can save storage space
So what is the default python encoding?
>>> import sys>>> sys.getdefaultencoding()'ascii'>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')>>> sys.getdefaultencoding()'utf-8'
Python uses ascii by default.sys.setdefaultencoding('utf-8')
Function sets the default python encoding.
In python, you can use encode and decode to change the data encoding, for example:
>>> U'chinese character 'U' \ u6c49 \ u5b57 '> u'chinese character '. encode ('utf-8') '\ xe6 \ xb1 \ x89 \ xe5 \ xad \ x97'> u'kanji '. encode ('utf-8 '). decode ('utf-8') u' \ u6c49 \ u5b57'
We can use these two functions to set encoding.
So what is the str type in python?
>>> Import binascii >>> 'Chinese character ''\ xba \ xd7 \ xd6 >>> type ('Chinese character ')
>>> Print binascii. b2a_hex ('kanji ') babad7d6 >>> print binascii. b2a_hex (u 'kanji') Traceback (most recent call last): File"
", Line 1, in
UnicodeEncodeError: 'ascii 'codec can't encode characters inposition 0-1: ordinal not in range (128) >>> print binascii. b2a_hex (u'kanji '. encode ('utf-8') e6b189e5ad97 >>> print binascii. b2a_hex (u'kanji '. encode ('gbk') babad7d6
Binascii converts binary data into ascii data. the preceding explanation is: 'Chinese character 'is of the str type, binary is babad7d6, and u'chinese character' cannot be converted to ascii, in this way, the first error is reported. The solution is to convert it. encode ('utf-8') to the str type. Because my command line is windows's default GBK encoding, all u'kanji'.encode(‘gbk')
The output result is the same as that of the 'China.
To sum up, the str of python is actually a unicode type. the default encoding of python is ascii. if it is converted from non-ascii to ascii, an error is returned. keep in mind the following rules:
Unicode => encode ('code') => str
Str => decode ('code') => unicode
Another simple method is to set the encoding in the file header, which saves a lot of trouble:
import sysreloads(sys)sys.setdefaultencoding('utf-8')
The second problem occurs when the file is read. There are two methods for UTF-8 files: bom and bom without bom. The difference between the two is that the bom file has an additional header than the bom file, resulting in an error when reading the file in UTF-8 mode, when I tried to read a file, I first checked whether the bom exists and skipped the header of the bom file. later, I failed ~~.
Google can still ask for help. the specific operation method is to use the codecs library to read files (I guess this library is used to detect the file header ).
import codecscodecs.open(file_name, "r",encoding='utf-8', errors='ignore')
For coding problems, you must understand how ascii, unicode, and UTF-8 work.
More python to solve Chinese character encoding problems: Unicode Decode Error_python related articles please follow PHP Chinese network!