Problems with UTF-8 file headers

Source: Internet
Author: User

When reading and writing a file about the UTF-8 format, especially when the TXT file such as the UTF-8 format, often encounter garbled problem caused by the UTF-8 file header. I have encountered another problem recently. Please write down and record the handling method. There are better ways to do this. You are welcome to leave a message.

All the three bytes of the file header encoded in UTF-8 format are represented as efbbbf in hexadecimal format, so you need to remove this file header when reading the file in UTF-8 format. And when you don't know whether the file to be read is in GBK or UTF-8 format, You have to determine through this file header. The specific method can be determined as follows:

1. Read the first three bytes from the file stream to a byte [3] array;
2. Use integer. tohexstring (byte [0] & 0xff) to convert the three bytes in the byte [3] array into hexadecimal characters;
3. According to the string obtained after the conversion of the three bytes, compare with the UTF-8 format header efbbbf to know whether the UTF-8 format.

When reading UTF-8 format files, you need to pay attention to the file header, while in the output of UTF-8 files, also pay attention to this file header, otherwise you output files, when using notepad to open, garbled characters may appear. To output efbbbf as the file header, perform the following operations:

1. Get EF, BB, and BF, for example, EF.

Byte b0 = byte. Decode ("0xe"). bytevalue (); // obtain the byte value of hexadecimal E.
B0 = (byte) (B0 <4); // shifts the byte value of hexadecimal e to 4
Byte b1 = byte. Decode ("0xf"). bytevalue (); // obtain the byte value of hexadecimal F.
Byte EF = (B0 | B1); // perform or operate the E and F after the four-digit left shift.

2. The obtained EF, BB, BF are output to the file in sequence as the first, second and third bytes of the UTF-8 file.

I have done some experiments. I used the C language to read a UTF-8 TXT file. when reading the file in binary mode, I can see the UTF-8 File Header under debugging, that is, the first three bytes read by the file are not the content of the file, but the UTF-8 mark. A txt file I created in Windows is GBK by default in Windows. Therefore, when I read it using C, I did not find that the first three bytes are in the logo encoding format, the first byte is the content of the TXT file.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.