Python solves the problem of Chinese character encoding: UnicodeDecodeError

Python solves the problem of Chinese character encoding: UnicodeDecodeError_python

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, python was used to read a document containing Chinese characters, leading to garbled characters and two errors. I had no choice but to search for answers on the internet. then, I solved the problem through the help of netizens, to sum up, the following article mainly introduces how to solve the problem of Chinese character encoding in python. if you need it, you can refer to it for reference. Preface

Recently, due to project requirements, you need to read a txt file containing Chinese characters and save the file after it is finished. The document was previously encoded with base64, leading to garbled characters in reading all Chinese characters. After the project team discards base64, two errors occur successively:

ascii codec can't encode characters in position ordinal not in range 128UnicodeDecodeError: ‘utf8' codec can't decode byte 0x。

If you are not familiar with ascii, unicode, and UTF-8, read the previous article about string and encoding.

You must understand the following three concepts:

Ascii can only represent numbers, English letters, and some special characters, not Chinese characters
Unicode and UTF-8 can both represent Chinese characters, unicode is a fixed length, and UTF-8 is a variable length.
In-memory storage is generally unicode, while disk file storage is generally UTF-8, because UTF-8 can save storage space

So what is the default python encoding?

>>> import sys>>> sys.getdefaultencoding()'ascii'>>> reload(sys)
 
  >>> sys.setdefaultencoding('utf-8')>>> sys.getdefaultencoding()'utf-8'

Python uses ascii by default.sys.setdefaultencoding('utf-8')Function sets the default python encoding.

In python, you can use encode and decode to change the data encoding, for example:

>>> U'chinese character 'U' \ u6c49 \ u5b57 '> u'chinese character '. encode ('utf-8') '\ xe6 \ xb1 \ x89 \ xe5 \ xad \ x97'> u'kanji '. encode ('utf-8 '). decode ('utf-8') u' \ u6c49 \ u5b57'

We can use these two functions to set encoding.

So what is the str type in python?

>>> Import binascii >>> 'Chinese character ''\ xba \ xd7 \ xd6 >>> type ('Chinese character ')
 
  
>>> Print binascii. b2a_hex ('kanji ') babad7d6 >>> print binascii. b2a_hex (u 'kanji') Traceback (most recent call last): File"
  
   
", Line 1, in
   
    
UnicodeEncodeError: 'ascii 'codec can't encode characters inposition 0-1: ordinal not in range (128) >>> print binascii. b2a_hex (u'kanji '. encode ('utf-8') e6b189e5ad97 >>> print binascii. b2a_hex (u'kanji '. encode ('gbk') babad7d6

Binascii converts binary data into ascii data. the preceding explanation is: 'Chinese character 'is of the str type, binary is babad7d6, and u'chinese character' cannot be converted to ascii, in this way, the first error is reported. The solution is to convert it. encode ('utf-8') to the str type. Because my command line is windows's default GBK encoding, all u'kanji'.encode(‘gbk')The output result is the same as that of the 'China.

To sum up, the str of python is actually a unicode type. the default encoding of python is ascii. if it is converted from non-ascii to ascii, an error is returned. keep in mind the following rules:

Unicode => encode ('code') => str
Str => decode ('code') => unicode

Another simple method is to set the encoding in the file header, which saves a lot of trouble:

import sysreloads(sys)sys.setdefaultencoding('utf-8')

The second problem occurs when the file is read. There are two methods for UTF-8 files: bom and bom without bom. The difference between the two is that the bom file has an additional header than the bom file, resulting in an error when reading the file in UTF-8 mode, when I tried to read a file, I first checked whether the bom exists and skipped the header of the bom file. later, I failed ~~.

Google can still ask for help. the specific operation method is to use the codecs library to read files (I guess this library is used to detect the file header ).

import codecscodecs.open(file_name, "r",encoding='utf-8', errors='ignore')

For coding problems, you must understand how ascii, unicode, and UTF-8 work.

More python to solve Chinese character encoding problems: Unicode Decode Error_python related articles please follow PHP Chinese network!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python solves the problem of Chinese character encoding: UnicodeDecodeError_python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python solves the problem of Chinese character encoding: UnicodeDecodeError_python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support