Python solves the problem of Chinese character encoding: UnicodeDecodeError_python

Source: Internet
Author: User
Recently, python was used to read a document containing Chinese characters, leading to garbled characters and two errors. I had no choice but to search for answers on the internet. then, I solved the problem through the help of netizens, to sum up, the following article mainly introduces how to solve the problem of Chinese character encoding in python. if you need it, you can refer to it for reference. Preface

Recently, due to project requirements, you need to read a txt file containing Chinese characters and save the file after it is finished. The document was previously encoded with base64, leading to garbled characters in reading all Chinese characters. After the project team discards base64, two errors occur successively:

ascii codec can't encode characters in position ordinal not in range 128UnicodeDecodeError: ‘utf8' codec can't decode byte 0x。

If you are not familiar with ascii, unicode, and UTF-8, read the previous article about string and encoding.

You must understand the following three concepts:

  1. Ascii can only represent numbers, English letters, and some special characters, not Chinese characters

  2. Unicode and UTF-8 can both represent Chinese characters, unicode is a fixed length, and UTF-8 is a variable length.

  3. In-memory storage is generally unicode, while disk file storage is generally UTF-8, because UTF-8 can save storage space

So what is the default python encoding?

>>> import sys>>> sys.getdefaultencoding()'ascii'>>> reload(sys)
 
  >>> sys.setdefaultencoding('utf-8')>>> sys.getdefaultencoding()'utf-8'
 

Python uses ascii by default.sys.setdefaultencoding('utf-8')Function sets the default python encoding.

In python, you can use encode and decode to change the data encoding, for example:

>>> U'chinese character 'U' \ u6c49 \ u5b57 '> u'chinese character '. encode ('utf-8') '\ xe6 \ xb1 \ x89 \ xe5 \ xad \ x97'> u'kanji '. encode ('utf-8 '). decode ('utf-8') u' \ u6c49 \ u5b57'

We can use these two functions to set encoding.

So what is the str type in python?

>>> Import binascii >>> 'Chinese character ''\ xba \ xd7 \ xd6 >>> type ('Chinese character ')
 
  
>>> Print binascii. b2a_hex ('kanji ') babad7d6 >>> print binascii. b2a_hex (u 'kanji') Traceback (most recent call last): File"
  
   
", Line 1, in
   
    
UnicodeEncodeError: 'ascii 'codec can't encode characters inposition 0-1: ordinal not in range (128) >>> print binascii. b2a_hex (u'kanji '. encode ('utf-8') e6b189e5ad97 >>> print binascii. b2a_hex (u'kanji '. encode ('gbk') babad7d6
   
  
 

Binascii converts binary data into ascii data. the preceding explanation is: 'Chinese character 'is of the str type, binary is babad7d6, and u'chinese character' cannot be converted to ascii, in this way, the first error is reported. The solution is to convert it. encode ('utf-8') to the str type. Because my command line is windows's default GBK encoding, all u'kanji'.encode(‘gbk')The output result is the same as that of the 'China.

To sum up, the str of python is actually a unicode type. the default encoding of python is ascii. if it is converted from non-ascii to ascii, an error is returned. keep in mind the following rules:

  1. Unicode => encode ('code') => str

  2. Str => decode ('code') => unicode

Another simple method is to set the encoding in the file header, which saves a lot of trouble:

import sysreloads(sys)sys.setdefaultencoding('utf-8')

The second problem occurs when the file is read. There are two methods for UTF-8 files: bom and bom without bom. The difference between the two is that the bom file has an additional header than the bom file, resulting in an error when reading the file in UTF-8 mode, when I tried to read a file, I first checked whether the bom exists and skipped the header of the bom file. later, I failed ~~.

Google can still ask for help. the specific operation method is to use the codecs library to read files (I guess this library is used to detect the file header ).

import codecscodecs.open(file_name, "r",encoding='utf-8', errors='ignore')

For coding problems, you must understand how ascii, unicode, and UTF-8 work.

More python to solve Chinese character encoding problems: Unicode Decode Error_python related articles please follow PHP Chinese network!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.