Python implements a file that converts Utf-8 format files into GBK format _python

Source: Internet
Author: User
Tags readfile

Requirements: Convert Utf-8-formatted files into GBK-formatted files

The implementation code is as follows:

Copy Code code as follows:

def ReadFile (filepath,encoding= "Utf-8"):
With Codecs.open (FilePath, "R", encoding) as F:
Return F.read ()

def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "w", Encoding) as F:
F.write (U)

def UTF8_2_GBK (SRC,DST):
Content = ReadFile (src,encoding= "Utf-8")
WriteFile (dst,content,encoding= "GBK")

Code Explanation:

The second parameter of the function ReadFile specifies that the file be read in utf-8 format and the resulting content is Unicode and then written to the file in GBK format.

This will enable demand.
However, if the file to be converted contains some characters that are not included in the GBK character set, an error occurs, similar to the following:

Copy Code code as follows:

Unicodeencodeerror: ' GBK ' codec can ' t encode character U ' \xa0 ' in position 4813:illegal multibyte

The above error message means that Unicode U ' \xa0 ' can not be encoded as GBK when encoding Unicode into GBK.

Here, we need to figure out the relationship between gb2312, GBK and GB18030.

Copy Code code as follows:

gb2312:6763 a Chinese character
Gbk:21003 a Chinese character
gb18030-2000:27533 a Chinese character
gb18030-2005:70244 a Chinese character

Therefore, GBK is a GB2312 superset of the GB18030 is a superset of GBK.
After clearing the relationship, we further refine the code:
Copy Code code as follows:

def UTF8_2_GBK (SRC,DST):
Content = ReadFile (src,encoding= "Utf-8")
WriteFile (dst,content,encoding= "GB18030")

After running, found no error, you can run normally.

Because, in the GB18030 character set, you can find the character corresponding to U ' \xa0 '.
In addition, there is another implementation scenario:
Need to modify the WriteFile method

Copy Code code as follows:

def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "w") as F:
F.write (U.encode (encoding,errors= "Ignore"))

Here, we put Unicode encoding (encode) into the GBK format, but note that the second argument of the Encode function, we assign "ignore", means that when encoding, we ignore those characters that can't be encoded and decode the same.

However, when we execute it, we find that we can successfully modify the file in UTF-8 format to ANSI format. However, it is also found that each row has a blank line in the generated file.

Here, you can specify that the file be written as a binary stream, with the following modified code:

Copy Code code as follows:

def writefile (filepath,u,encoding= "GBK"):
With Codecs.open (FilePath, "WB") as F:
F.write (U.encode (encoding,errors= "Ignore"))

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.