Code Conversion: UTF-8 BOM to GBK

Source: Internet
Author: User

Recently because of an experiment, the encoding format of the text file is GBK or gb2312, and the source data is a lot of encoding formats, some are GBK, some are UTF-8, so it is not easy to use tools to directly convert, manual is not desirable, so I plan to write a program, can read the UTF-8 file, change the encoding, and then write another file, save as GBK format.

I have read many examples on the Internet and checked Python APIs. I found that they are incomplete because I did not consider BOM or Bom. It is also said that codecs. encodedfile (...) has been used for many times, and conversion from BOM to GBK fails.

Let's take a look at my experiment process.

1. UTF-8 without BOM to GBK

First, use ultraedit to create a text file, coexist for the UTF-8 without a bomformat, I call “u8.txt ", the content is" try this ", with a carriage return line, and the characters with spaces, in order to locate;

Second, write code for testing:

Import codecs

F = open ("u8.txt", "R ");
Data = f. Read ();
Print repr (data)
Print data. Decode ("UTF-8"). encode ("GBK ")

S = '/xe8/xbf/x99/xe4/xb8/xa ';
Print S. Decode ("UTF-8"). encode ("GBK ")
F. Close ();

Run the command to view the result:

According to the output can be seen, the first line is 4 characters of UTF-8 encoding, each character occupies 3 bytes, followed by a line break; the second line is encoded as GBK after printing results, it indicates that there is no problem in encoding conversion. Note that the content of string S is the byte encoding of the last two characters that I have manually captured, that is, "This character". After S is decoded and then encoded into GBK, output successfully.

Therefore, it is easy to convert Bom-free UTF-8 into GBK.

2. Transfer UTF-8 with BOM to GBK

First, you still need to use ultraeditto create a new character file, but it is saved as the symbol 8. Here there is a bomformat, And I am character TXT. The content is "the old child does not believe it can not be converted". It is also a space to locate characters. The result is:

An error occurred! Note that we cannot use GBK encoding for 0 bytes. Then we can look at the displayed encoding sequence and find that the first six bytes are connected together, but our strings are clearly separated by spaces! What's going on ?! By checking Wikipedia, we found that:

The original "/XeF/xbb/xbf" is the BOM of the UTF-8! Does this BOM cause code conversion failure? You can copy the content after BOM as the content of string S and write the code for testing. The Code is as follows:

S = '/xe8/X80/x81/xe5/XAD/x90/xe4/xb8/x8d/xe4/xbf/xA1/xe8/xbd/xac/xe4/xb8/x8d/ xe8/xbf/x87/xe6/x9d/xa5/N'

Print S. Decode ("UTF-8"). encode ("GBK ")
After the operation, the conversion is successful. That is to say, as long as the BOM can be removed before the conversion, the conversion is successful.
Based on the above experiences, I wrote the following code for testing:

Code Description: BOM refers to the content of the first three bytes of the file. codecs. bom_utf8 represents the UTF-8 BOM Byte encoding, as mentioned earlier. The IF statement determines whether it is in utf8 BOM format. If yes, we will read the file from the first byte of the file to avoid Bom, then convert the file, and finally output it successfully.

The Line "NOTE" in the Code prints out the byte encoding. It can be seen that the character has been converted to the 2 bytes per character GBK encoding format.

Conclusion: only the utf8 format is written here. The UTF16 principle is the same, but the BOM of UTF16 contains only two bytes. The above BOM diagram is shown. To sum up, it is easy to convert utf8 to GBK because all characters correspond to each other. It is only because the BOM makes the problem more troublesome, but it is not complicated to understand the principle.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.