Python character encoding processing problem summary completely smash garbled!

Source: Internet
Author: User

This character encoding problem is often encountered in python, especially when working with Web page source code (especially in crawlers):

Unicodedecodeerror: ' XXX ' codec can ' t decode bytes in position 12-15:illegal multibyte ...


The following is a Chinese character ' ha ' to explain all the questions, the various encodings of the kanji "ha" are as follows:

1 UNICODE (utf8-16): 0xc854

2 UTF-8:0xe59388

3 Gbk:0xb9fe


Besides, it is like gb2312, Big5 and so on. For example, some pages containing traditional characters. For example, WWW.GOOGLE.COM.HK home is Big5 code,

It is not even more depressing to deal with Simplified Chinese characters at the same time as the code farmers of RTHK:)

has been. The Chinese coding in Python is a big problem, and for him he can't intelligently identify the coding, but in fact other languages are very difficult to do.

Character encodings are generally found in the header of HTML, such as:

<meta http-equiv= "Content-type" content= "text/html; charset=gb2312"/>


Of course, this is not the focus of our research, many other times we learned that a string is GBK encoding, and print, and so the correct printing is not easy ...

First, Unicode is mentioned in Python. Generally refers to Unicode objects, such as ' haha ' Unicode object is U ' \u54c8\u54c8 '

And Str is a byte array. This byte array represents the stored format of the Unicode object after encoding (such as Utf-8, GBK, cp936, GB2312), where it

is only a byte stream. There is no other meaning, assuming that you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.


For example: ( note is under Windows )

s = u ' haha '

S_utf8 = S.encode (' Utf-8 ')

Pirnt S_utf8

>>> Å 堝 搱

Tragedy...

S_utf8 is actually ' \xe5\x93\x88\xe5\x93\x88 '

And the following code is good enough to display:

S_gdb = S.encode (' GBK ') # S_GDK This is ' \xb9\xfe\xb9\xfe '

Print S_GBK

>>> haha #正常了

Because its implementation is the print statement that will output the content of the operating system, the operating system will encode the input byte stream according to the encoding of the system. That would explain.

Utf-8 format string "haha", the output is "Å 堝 搱", because ' \xe5\x93\x88\xe5\x93\x88 ' with GB2312 to explain, its display is

"Å 堝 搱".

Here again, str records a byte array, which is just a coded storage format, as to what format to output to a file or print out.

It all depends on what the decoding code does to decode it.


Here's a bit more on print: When a Unicode object is passed to print, the Unicode object is internally converted.

Convert cost to default encoding (this is only a personal push)


The conversion of STR and Unicode objects, implemented through encode and decode, is used in detail, such as the following: Stress again under Windows:

s = ' haha '

Print S.decode (' GBK '). Encode (' Utf-8 ')

>>> Å 堝 搱

And vice versa, interested in being able to experiment with other conversions


Sometimes when we encounter a direct encoding of S (GBK string) into Utf-8. Throws an exception, but by invoking the following code, for example:

Import Sys

Reload (SYS)

Sys.setdefaultencoding (' GBK ')

can then be converted successfully. Why is it?

In Python, str and Unicode are in the process of encoding and decoding. Suppose a STR is encoded directly into a code that decodes str into Unicode first,

With the default encoding, the general default encoding is ANSCII, so there will be an error when the first conversion in the sample code is shown above.

When you set the current default encoding to ' GBK ', there is no error.

As for Reload (SYS), this method is removed sys.setdefaultencoding after initialization of Python2.5. We need to load it again.

It is generally not recommended to use this. Reload are supposed to avoid the use of functions.


You may also experience this issue with files that manipulate encoded formats for different files

Create a file test.txt, file format with ANSI, content:

ABC Chinese


And then use Python to read

# CODING=GBK

Print open ("Test.txt"). Read ()

Result: ABC Chinese

Change the file format to UTF-8:

Result: ABC Juan Po, apparently. Here we need to decode:


# CODING=GBK

Import Codecs

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Result: ABC Chinese

The above test.txt I use EditPlus to edit. But when I edit the UTF-8 format with the Notepad that comes with Windows,

Executive Times Error:

Traceback (most recent):

File "chinesetest.py", line 3, in

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte sequence

Originally, some software, such as Notepad. When you save a file that is encoded with UTF-8,

Three invisible characters (0xEF 0xBB 0xBF, or BOM) are inserted where the file starts.


So we need to remove these characters when we read them. The codecs module in Python defines this constant:

# CODING=GBK

Import Codecs

data = open ("Test.txt"). Read ()

if data[:3] = = codecs. Bom_utf8:

data = Data[3:]

Print Data.decode ("Utf-8")

Result: ABC Chinese


Finally, sometimes the code is right, but it encounters illegal characters. For example, a source error that produces a string occurs. Error values are introduced, and then the exception is encountered again

For example, full-width spaces are often implemented in many different ways. For example, \xa3\xa0, or \xa4\x57,

These characters. All appear to be full-width spaces. But they're not "legit" full-width spaces.

The true full-width space is \xa1\xa1, so an exception occurred during transcoding.

In the past, when processing Sina Weibo data. An illegal space problem was encountered that prevented the data from being parsed correctly.


The workaround:

The string to be fetched is strtxt done decode. Indicate ignore. Illegal characters are ignored,

Of course, for GBK and other encodings, the same approach to dealing with the same problem is similar

Strtest = Strtxt.decode (' utf-8 ', ' ignore ')

return Strtest

The default number of references is strict, which represents an exception when an illegal character is encountered;

The assumption is set to ignore. The illegal characters are ignored;

The assumption is set to replace. It will be used?

instead of illegal characters;

If set to Xmlcharrefreplace, the character reference of the XML is used.


Other later encountered again summed up .....


Python character encoding processing problem summary completely smash garbled!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.