This character encoding problem is often encountered in python, especially when working with Web page source code (especially in crawlers):
Unicodedecodeerror: ' XXX ' codec can ' t decode bytes in position 12-15:illegal multibyte ...
The following is a Chinese character ' ha ' to explain all the questions, the various encodings of the kanji "ha" are as follows:
1 UNICODE (utf8-16): 0xc854
2 UTF-8:0xe59388
3 Gbk:0xb9fe
Besides, it is like gb2312, Big5 and so on. For example, some pages containing traditional characters. For example, WWW.GOOGLE.COM.HK home is Big5 code,
It is not even more depressing to deal with Simplified Chinese characters at the same time as the code farmers of RTHK:)
has been. The Chinese coding in Python is a big problem, and for him he can't intelligently identify the coding, but in fact other languages are very difficult to do.
Character encodings are generally found in the header of HTML, such as:
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312"/>
Of course, this is not the focus of our research, many other times we learned that a string is GBK encoding, and print, and so the correct printing is not easy ...
First, Unicode is mentioned in Python. Generally refers to Unicode objects, such as ' haha ' Unicode object is U ' \u54c8\u54c8 '
And Str is a byte array. This byte array represents the stored format of the Unicode object after encoding (such as Utf-8, GBK, cp936, GB2312), where it
is only a byte stream. There is no other meaning, assuming that you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
For example: (
note is under Windows )
s = u ' haha '
S_utf8 = S.encode (' Utf-8 ')
Pirnt S_utf8
>>> Å 堝 搱
Tragedy...
S_utf8 is actually ' \xe5\x93\x88\xe5\x93\x88 '
And the following code is good enough to display:
S_gdb = S.encode (' GBK ') # S_GDK This is ' \xb9\xfe\xb9\xfe '
Print S_GBK
>>> haha #正常了
Because its implementation is the print statement that will output the content of the operating system, the operating system will encode the input byte stream according to the encoding of the system. That would explain.
Utf-8 format string "haha", the output is "Å 堝 搱", because ' \xe5\x93\x88\xe5\x93\x88 ' with GB2312 to explain, its display is
"Å 堝 搱".
Here again, str records a byte array, which is just a coded storage format, as to what format to output to a file or print out.
It all depends on what the decoding code does to decode it.
Here's a bit more on print: When a Unicode object is passed to print, the Unicode object is internally converted.
Convert cost to default encoding (this is only a personal push)
The conversion of STR and Unicode objects, implemented through encode and decode, is used in detail, such as the following: Stress again under Windows:
s = ' haha '
Print S.decode (' GBK '). Encode (' Utf-8 ')
>>> Å 堝 搱
And vice versa, interested in being able to experiment with other conversions
Sometimes when we encounter a direct encoding of S (GBK string) into Utf-8. Throws an exception, but by invoking the following code, for example:
Import Sys
Reload (SYS)
Sys.setdefaultencoding (' GBK ')
can then be converted successfully. Why is it?
In Python, str and Unicode are in the process of encoding and decoding. Suppose a STR is encoded directly into a code that decodes str into Unicode first,
With the default encoding, the general default encoding is ANSCII, so there will be an error when the first conversion in the sample code is shown above.
When you set the current default encoding to ' GBK ', there is no error.
As for Reload (SYS), this method is removed sys.setdefaultencoding after initialization of Python2.5. We need to load it again.
It is generally not recommended to use this. Reload are supposed to avoid the use of functions.
You may also experience this issue with files that manipulate encoded formats for different files
Create a file test.txt, file format with ANSI, content:
ABC Chinese
And then use Python to read
# CODING=GBK
Print open ("Test.txt"). Read ()
Result: ABC Chinese
Change the file format to UTF-8:
Result: ABC Juan Po, apparently. Here we need to decode:
# CODING=GBK
Import Codecs
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Result: ABC Chinese
The above test.txt I use EditPlus to edit. But when I edit the UTF-8 format with the Notepad that comes with Windows,
Executive Times Error:
Traceback (most recent):
File "chinesetest.py", line 3, in
Print open ("Test.txt"). Read (). Decode ("Utf-8")
Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \ufeff ' in position 0:illegal multibyte sequence
Originally, some software, such as Notepad. When you save a file that is encoded with UTF-8,
Three invisible characters (0xEF 0xBB 0xBF, or BOM) are inserted where the file starts.
So we need to remove these characters when we read them. The codecs module in Python defines this constant:
# CODING=GBK
Import Codecs
data = open ("Test.txt"). Read ()
if data[:3] = = codecs. Bom_utf8:
data = Data[3:]
Print Data.decode ("Utf-8")
Result: ABC Chinese
Finally, sometimes the code is right, but it encounters illegal characters. For example, a source error that produces a string occurs. Error values are introduced, and then the exception is encountered again
For example, full-width spaces are often implemented in many different ways. For example, \xa3\xa0, or \xa4\x57,
These characters. All appear to be full-width spaces. But they're not "legit" full-width spaces.
The true full-width space is \xa1\xa1, so an exception occurred during transcoding.
In the past, when processing Sina Weibo data. An illegal space problem was encountered that prevented the data from being parsed correctly.
The workaround:
The string to be fetched is strtxt done decode. Indicate ignore. Illegal characters are ignored,
Of course, for GBK and other encodings, the same approach to dealing with the same problem is similar
Strtest = Strtxt.decode (' utf-8 ', ' ignore ')
return Strtest
The default number of references is strict, which represents an exception when an illegal character is encountered;
The assumption is set to ignore. The illegal characters are ignored;
The assumption is set to replace. It will be used?
instead of illegal characters;
If set to Xmlcharrefreplace, the character reference of the XML is used.
Other later encountered again summed up .....
Python character encoding processing problem summary completely smash garbled!