Size of text files and binary files, Unicode character encoding

Source: Internet
Author: User

Today, I am not very clear about the programs written by others, that is, how to distinguish binary files from text files by using cfile In the MFC program.

First, let's talk about the differences between binary files and text files:

I found an article on the Internet, which is quite basic and easy to understand. I 'd like to share with you:

 

 

Now I understand the relationship between a text file and a binary file.

We can use the binary editor to view text files.

The left side of the red box is displayed in hexadecimal notation, and the right side is text (ASCII

The program has the following sentence:

At first, I didn't understand why I had to write two bytes of "fffe". Then I checked the information online and found that this was related to the encoding method.

UTF byte order and BOM

The UTF-8 is encoded in bytes and there is no issue of bytecode.The UTF-16 uses two bytes as the encoding unit. before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, if you receive a unicode code of "queue", the Unicode code of "queue" is 594e, and "B" is 4e59. If we receive the UTF-16 byte stream "594e", is this "Kui" or "B "?

The recommended method for marking byte order in Unicode specifications is Bom. Bom is not a "bill of material" Bom,Byte order mark. Bom is a bit clever: There is a character named "Zero Width no-break space" in the UCS encoding, and its encoding is feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ".

This wayIf the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian.. Therefore, the character "Zero Width no-break space" is also called Bom.

The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.

Windows uses BOM to mark the encoding of text files.

--------------------

After solving these problems, let's test the encoding, reading, and writing of text files.

Take notepad in Windows as an example (the principle of reading other text files should be similar, but there are some special judgment algorithms ).

By default, notepad has four encodings to store and read text files. They are ﹕

ANSI, Unicode, Unicode-big-Endian and UTF-8.

First, let's talk about ANSI. This is the encoding set for the Windows operating system in the region and language block (that is, the default encoding of the system). Therefore, the traditional operating system is big5, the simplified operating system is GBK.

And Unicode and UTF-8 these two formats I believe you have some understanding (of course the former is unicode-16)

What does Unicode-big-Endian mean? It is almost the same as Unicode, but it puts the high level in front (while the latter is the opposite)

The above excerpt has been explained. Here I will explain it again ﹕

For example, the character "a" is stored in the following formats ﹕

UTF-16 big-Endian: 00 41

Little-Endian UTF-16: 41 00

UTF-32 big-Endian: 00 00 00 41

UTF-32 little-Endian: 41 00 00

All right, let's think about it. text files are stored in bytes on the hard disk. If you do not know the encoding of text files, you cannot read the text files correctly and show them to users (garbled characters ). the program considers everything to be normal)

According to BOM rules, if the following bytes are received at the beginning of a byte stream, the encoding of the text file is indicated.
UTF-8: EF BB BF

UTF-16: FF fe

UTF-16 big-Endian: Fe FF

UTF-32 little-Endian: FF Fe 00

UTF-32 big-Endian: 00 00 Fe FF

If it does not start with this, the program reads data in ANSI, that is, the system's default encoding.

 

Here the specific introduction of Chinese encoding: http://www.cnblogs.com/xkfz007/articles/2566434.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.