Requirements: Convert Utf-8-formatted files into GBK-formatted files
The implementation code is as follows:
Copy Code code as follows:
def ReadFile (filepath,encoding= "Utf-8"):
With Codecs.open (FilePath, "R", encoding) as F:
Return F.read ()
def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "w", Encoding) as F:
F.write (U)
def UTF8_2_GBK (SRC,DST):
Content = ReadFile (src,encoding= "Utf-8")
WriteFile (dst,content,encoding= "GBK")
Code Explanation:
The second parameter of the function ReadFile specifies that the file be read in utf-8 format and the resulting content is Unicode and then written to the file in GBK format.
This will enable demand.
However, if the file to be converted contains some characters that are not included in the GBK character set, an error occurs, similar to the following:
Copy Code code as follows:
Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \xa0 ' in position 4813:illegal multibyte
The above error message means that Unicode U ' \xa0 ' can not be encoded as GBK when encoding Unicode into GBK.
Here, we need to figure out the relationship between gb2312, GBK and GB18030.
Copy Code code as follows:
gb2312:6763 a Chinese character
Gbk:21003 a Chinese character
gb18030-2000:27533 a Chinese character
gb18030-2005:70244 a Chinese character
Therefore, GBK is a GB2312 superset of the GB18030 is a superset of GBK.
After clearing the relationship, we further refine the code:
Copy Code code as follows:
def UTF8_2_GBK (SRC,DST):
Content = ReadFile (src,encoding= "Utf-8")
WriteFile (dst,content,encoding= "GB18030")
After running, found no error, you can run normally.
Because, in the GB18030 character set, you can find the character corresponding to U ' \xa0 '.
In addition, there is another implementation scenario:
Need to modify the WriteFile method
Copy Code code as follows:
def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "w") as F:
F.write (U.encode (encoding,errors= "Ignore"))
Here, we put Unicode encoding (encode) into the GBK format, but note that the second argument of the Encode function, we assign "ignore", means that when encoding, we ignore those characters that can't be encoded and decode the same.
However, when we execute it, we find that we can successfully modify the file in UTF-8 format to ANSI format. However, it is also found that each row has a blank line in the generated file.
Here, you can specify that the file be written as a binary stream, with the following modified code:
Copy Code code as follows:
def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "WB") as F:
F.write (U.encode (encoding,errors= "Ignore"))