The character string is unicode encoded in Python. Therefore, during encoding and conversion, unicode is usually used as the intermediate encoding, that is, the other encoded strings are decoded into unicode, then, convert the unicode encoding (encode) into another encoding.
The function of decode is to convert other encoded strings to unicode encoding, such as str1.decode ('gb2312'), which means to convert the string str1 encoded in gb2312 to unicode encoding.
Encode is used to convert unicode to other encoded strings, for example, str2.encode ('gb2312'), which means to convert the unicode encoded string str2 to gb2312 encoding.
Therefore, during transcoding, you must first understand the encoding of the str string, decode into unicode, and then encode into other encodings.
(The code is the same as the Code itself !)
Test:
The code in my eclipse is UTF-8 encoded. Then I write code like this.
S = "hello"
S = s. decode ('gb2312'). encode ('utf-8 ')
Print s
Error:
UnicodeDecodeError: 'gb2312 'codec can't decode bytes in position 2-3: illegal multibyte sequence
Cause: Because my file is UTF-8 encoded. Therefore, it is impossible to convert it to unicode using gb2312.
Therefore, the correct statement should be:
S = "hello"
Print s
S = s. decode ('utf-8'). encode ('utf-8') uses UTF-8 for encoding
Print s
Haha I found that the printed code is garbled. One thing I can only say is that my eclipse console is GB2312 encoded!
See:
How do I obtain the default encoding of the system?
#! /Usr/bin/env python
# Coding = UTF-8
Import sys
Print sys. getdefaultencoding ()
This program outputs ascii on Windows XP. I found that my linux is also ascii encoded. So I want to print out the garbled characters. Because I am actually UTF-8 encoded.
In some ides, the output of strings is always garbled or even incorrect. In fact, the IDE result output console itself cannot display the encoding of strings, rather than the program itself. (Yes. My eclipse console is gb2312 encoding, so when my file is saved as UTF-8, It is garbled by printing !)
1. The command for reading files must be:
Myfile = codecs. open ("c.html", "r", "UTF-8"): If I use gb2312 to read data, an error is returned.
TIPS: Check the encoding of a string. You only need to check the decode. If no error is reported for decode Using gb2312, it indicates that it is gb2312.
If no error is reported for decode with UTF-8, it indicates that it is UTF-8.
The problem is
See:
Myfile = codecs. open ("c.html", "r", "UTF-8 ")
Str = myfile. read ()
Content = str. replace ("\ n ","")
Content = content. encode ('utf-8 ')
Print content
No error reported
Let's look at it again:
Myfile = codecs. open ("c.html", "r", "UTF-8 ")
Str = myfile. read () # display Chinese Characters
Content = str. replace ("\ n ","")
Content = content. encode ('gb2312') with gb2312
Print content
Error: UnicodeEncodeError: 'gb2312 'codec can't encode character U' \ u2014' in position 12628
Let's look at it again:
Myfile = codecs. open ("d.html", "r", "UTF-8 ")
Str = myfile. read () # display Chinese Characters
Content = str. replace ("\ n ","")
Content = content. encode ('gb2312') with gb2312
Print content
No problem
Myfile = codecs. open ("d.html", "r", "UTF-8 ")
Str = myfile. read () # display Chinese Characters
Content = str. replace ("\ n ","")
Content = content. encode ('utf-8 ')
Print content
No problem.
Conclusion: some special characters in the c.html Page Support UTF-8 encoding only. The gb2312 encoding is not supported!
D.html does not have such special characters. This explains why
Some files have not encountered any problems we imagined!
So I feel that the open file must be read with UTF-8 to get a unicode encoded value!
And then encode it with UTF-8. Because if you do gb2312 processing, an error will be reported!
Next:
I read my regular expression and found that if gb2312 is used for decoding, an error is returned. So it must be UTF-8 encoded!
Regex3 = regex3.decode ('utf-8 ')
Print type (regex3) # The returned unicode code!
Print regex3 # It seems strange to print as normal Chinese
Solution:
1. All are processed with unicode
That is, I used regex3 = regex3.decode ('utf-8') to process the regular expression into unicode encoding. Then the content is
Print type (content) is also unicode encoding. The result still does not work!
Is it because of the encoding of my linux terminal? I looked at it.
Locale is found to be a GBK terminal. That is, only the GBK encoding can be displayed as Chinese!
So I will
Regex3 = regex3.decode ('utf-8'). encode ('gb2312') is encoded into gb2312. Chinese characters are displayed!
OK. I made my content GB2312 together.
Content = content. encode ('gb2312', 'ignore ')
Print content can also be printed out in Chinese.
I think there should be no problem at this time. The result is killed again with the regular expression. Dizzy !!!!!!!
Let's test another good file: after the change, we found that it was not dead and succeeded!
So I think: some content in this file must conflict with the regular expression match! Cause!
Continue tracking:
The following situations occur:
Myfile = codecs. open ("01.htm"," r "," UTF-8 "," ignore ")
Str = myfile. read ()
Content = str. replace ("\ n ","")
Print type (content) # unicode code found
Regex3 = 'class = wpcpsCSS> ([^ <] + )(? :.*? Wpcppb_CSS> ([0-9] +) </span> )?. *? (? :.*? (Disabled ))?. *? ([0-9] +) answers .*? ([0-9] +) views .*? (? : <Div class = wpcptfCSS> .*? User \? Userid = ([0-9] +). *?> (.*?) </A> </div> .*?)? (? : User \? Userid = ([0-9] + )")? Class = "wpfitCSS [^"] + "> ([^ <] + ).*? Class = wpcptsCSS> ([^ <] + ).*? ([0-9.] {9 ,}\*).*? Class = wpcpdCSS> (.*?) </Div> <div class = wpcpfCSS>'
Content = content. encode ('utf-8 ')
P = re. compile (regex3)
Results = p. findall (content)
No problem can be solved. But I
If you set content = content. encode ('gb2312'), you will find that it is dead!
It indicates that the content of my content is different from the regular encoding of my content!
Now I will convert my regular expression to gb2312 for testing. The result is displayed. And my results
Results = p. findall (content)
For ele in results:
Print ele [0], ele [1], ele [2], ele [3], ele [4], ele [5], ele [6], ele [7], ele [8], ele [9], ele [10]
There is no problem in eclipse (gb2312 by default !~
So I think: if the content is GBK, the regular content should also be GBK, that is, the encoding of the two must be consistent! Otherwise, the program will die!
Now let me handle it like this.
All use unicode encoding
Myfile = codecs. open ("right.html", "r ")
Str = myfile. read ()
Content = str. replace ("\ n ","")
Content = content. decode ('utf-8', 'ignore') # decoded using UTF-8
Use unicode encoding.
Regular Expressions are also used
Regex3 = regex3.decode ('utf-8', 'ignore') uses UTF-8 for unicode encoding.
OK. Now try again!
Conclusion:
Conclusion:
1. Open a file
Myfile = codecs. open ("right.html", "r ")
You do not need to set its encoding!
Set encoding format
Str = myfile. read ()
Content = str. replace ("\ n ","")
Content = content. decode ('utf-8', 'ignore') # use UTF-8 to decode to unicode
Regular Expression:
Regex3 = regex3.decode ('utf-8', 'ignore') # regular expressions are also decoded in unicode format using UTF-8.
Then you can
P = re. compile (regex3)
Results = p. findall (content)
The regular expression is called!
Author: lonely travelers