Python character encoding processing problem summary completely smash garbled!

Source: Internet
Author: User

This character encoding problem is often encountered in python, especially when working with Web page source code (especially in crawlers):

Unicodedecodeerror: ' XXX ' codec can ' t decode bytes in position 12-15:illegal multibyte ...


The following is a sample interpretation of the Chinese characters ' ha ' to illustrate all the problems. The various encodings of the kanji "ha" are as follows:

1 UNICODE (utf8-16): 0xc854

2 UTF-8:0xe59388

3 Gbk:0xb9fe


Besides, it is like gb2312, Big5 and so on. For example, some pages containing traditional characters. For example www.google.com.hk home with is Big5 code.

It is not even more depressing to deal with Simplified Chinese characters at the same time as the code farmers of RTHK:)

The Chinese coding in Python has always been a big problem. For him he could not intelligently identify the coding, while in fact other languages were very difficult to do.

Character encodings are generally found in the header of HTML, such as:

<meta http-equiv= "Content-type" content= "text/html; charset=gb2312"/>


Of course, this is not the focus of our research. A lot of other times we learned that a string is GBK encoded. and to use print and so on the correct printing is not easy ...

First, Unicode is mentioned in Python. It is generally referred to as a Unicode object. For example, ' haha ' Unicode object is U ' \u54c8\u54c8 '

While Str is a byte array, this byte array represents the stored format of the Unicode object after encoding (such as Utf-8, GBK, cp936, GB2312). Here it

Only a byte stream, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.


For example: ( note is under Windows )

s = u ' haha '

S_utf8 = S.encode (' Utf-8 ')

Pirnt S_utf8

>>> Å 堝 搱

Tragedy...

S_utf8 is actually ' \xe5\x93\x88\xe5\x93\x88 '

And the following code is good enough to display:

S_gdb = S.encode (' GBK ') # S_GDK This is ' \xb9\xfe\xb9\xfe '

Print S_GBK

>>> haha #正常了

Because its implementation is the print statement that will output the content of the operating system, the operating system will encode the input byte stream according to the encoding of the system. That would explain.

The string "haha" in utf-8 format. Output is "Å 堝 搱", because ' \xe5\x93\x88\xe5\x93\x88 ' with GB2312 to explain, its display is

"Å 堝 搱".

Here again, str records a byte array. is just some kind of coded storage format. As for the output to the file or print out what format,

It all depends on what the decoding code does to decode it.


Here's a little more on print: When you pass a Unicode object to print. Internally, the Unicode object is converted one at a time,

Convert cost to default encoding (this is only a personal push)


The conversion of STR and Unicode objects. Implemented through encode and decode, the details are used such as the following: Stress again under Windows:

s = ' haha '

Print S.decode (' GBK '). Encode (' Utf-8 ')

>>> Å 堝 搱

And vice versa, interested in being able to experiment with other conversions


Sometimes we throw an exception when we encounter a direct encoding of S (GBK string) into Utf-8. But by calling for example the following code:

Import Sys

Reload (SYS)

Sys.setdefaultencoding (' GBK ')

can then be converted successfully. Why is it?

In Python, str and Unicode encode and decode, assuming that a STR is encoded directly into a code that first decodes str into Unicode,

With the default encoding, the general default encoding is ANSCII, so there will be an error when the first conversion in the sample code is shown above.

When you set the current default encoding to ' GBK '. There will be no mistakes.

As for Reload (SYS), this method is removed sys.setdefaultencoding after initialization of Python2.5. We need to load it again.

It is generally not recommended to use this. Reload are supposed to avoid the use of functions.


You may also experience this issue with files that manipulate encoded formats for different files

Create a file test.txt, file format with ANSI, content:

ABC Chinese


And then use Python to read

# CODING=GBK

Print open ("Test.txt"). Read ()

Result: ABC Chinese

Change the file format to UTF-8:

Result: ABC Juan Po. Clearly, this needs to be decoded:


# CODING=GBK

Import Codecs

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Result: ABC Chinese

The above test.txt I use EditPlus to edit, but when I use Windows to bring the Notepad editor and coexist in UTF-8 format.

Executive Times Error:

Traceback (most recent):

File "chinesetest.py", line 3, in

Print open ("Test.txt"). Read (). Decode ("Utf-8")

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte sequence

Originally. Some software, such as Notepad. When you save a file that is encoded with UTF-8,

Three invisible characters (0xEF 0xBB 0xBF, or BOM) are inserted where the file starts.


So we need to remove these characters when we read them, and the codecs module in Python defines this constant:

# CODING=GBK

Import Codecs

data = open ("Test.txt"). Read ()

if data[:3] = = codecs. Bom_utf8:

data = Data[3:]

Print Data.decode ("Utf-8")

Result: ABC Chinese


Finally, sometimes the code is right, but it encounters illegal characters. For example, the origin of the string error occurs, the introduction of the error value, and then again encountered an exception

For example: full-width spaces tend to have many different implementations, such as \xa3\xa0, or \xa4\x57,

These characters appear to be full-width spaces. But they're not "legit" full-width spaces.

The true full-width space is \xa1\xa1, so an exception occurred during transcoding.

In the previous processing of Sina Weibo data, encountered illegal space problem caused the data can not be parsed correctly.


The workaround:

When the obtained string strtxt is decode, the ignore is specified and the illegal characters are ignored.

Of course, for GBK encoding. The approach to dealing with the same problem is similar

Strtest = Strtxt.decode (' utf-8 ', ' ignore ')

return Strtest

The default number of references is strict, which represents an exception when an illegal character is encountered;

If set to ignore, illegal characters are ignored;

If set to replace, it will be used?

instead of illegal characters;

If set to Xmlcharrefreplace, the character reference of the XML is used.


Other later encountered again summed up .....


Python character encoding processing problem summary completely smash garbled!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.