File encoding and read-write attention under Windows

Source: Internet
Author: User

Recently wrote a small tool, the main implementation of the automatic search for all text file content under the specified path, and replace the specified string, recording the next few places prone to mistakes.
1,fread read the buffer stored in the string, if you want to search for the specified string, such as "People's Republic", first to determine the file encoding format is? According to the encoding format of the file, the target string can be encoded and converted before the string.find is used to find it.
And how to determine the encoding format of the file, Windows text file generally has two character sets, Unicode and GBK, according to the first three bytes of the file to determine the encoding format of the file, which three bytes is "EF BB BF", if the first three bytes of these three values, the default is UTF8 encoded, Otherwise, the default is GBK.

off-topic: Of course, the above judgment is relatively sketchy, because Windows can also use UTF16 to encode storage files, open Notepad, save it, use Unicode to encode, or Unicode big endian to encode. Windows Notepad uses a BOM to store the encoded format of the file in front of the file
UTF8 encoded BOM for EF BB BF
UTF16 encoded BOM for FF FE
UTF16 big endian for FE FF,
usually uft16 is also called the UCS code, usually on Windows we speak Unicode encoding, meaning UTF16 encoding. There are also some terms called wide characters, because he uses two bytes to store text, even ASCII, and uses two bytes to store it.

then some people will ask, is it simply based on the first three bytes to determine the file encoding format is not a problem, of course, there will be problems, such as you use Notepad to create a new text file, write "Unicom" two words, and then save, open again, you will find the file is garbled.
why, because Windows Notepad UTF8 it as a code to identify, and here someone will ask, not to say UTF8 have BOM header to identify this is UTF8 encoded file? Yes, UTF8 does not enforce the use of BOM header, there is a BOM header utf8 file, can also be converted to UTF8 no BOM encoding format, you can use notpad++ test, so cannot notepad can not rely solely on the BOM head to determine whether it is UTF8 code, It is precisely this reason that a simple coding analysis of the data in the file is needed to make the recognition error.
first of all, the UTF8 coding structure, UTF8 is to use 1-3 bytes to encode characters, encoded byte number and Unicode character set has a strict correspondence. Look at the corresponding relationship table below
Unicode encoding Set UTF8 encoding structure
u0001-u007e 0XXXXXXX
u0080-u07ff and U0000 110XXXXX 10XXXXXX
u0800-uffff 1110XXXX 10XXXXXX

Unicom's GBK character set is "C1 AA CD A8"
C1 AA corresponding binary: 0 0001, ten1010
CD AB corresponds to binary: 0 1101 , ten
Note that the red part, which is exactly the same as the UTF8 encoding structure, is mistaken for a file that is UTF8 without BOM encoding. You can use notepad++ to open to view the encoding format of this file


2, in Windows, the line to use the \ r \ n to wrap, that is, 0D 0 A, so when using binary to view the file, you will see a lot of 0D 0 A, but if fopen only use "R" tag to read the file, will not read 0D, you must use "RB" Mark to read, also in the writing of the file, you have to use "WB" to write, can 0D write in.


Original view: http://www.seanyxie.com/windows%E4%B8%8B%E6%96%87%E4%BB%B6%E7%BC%96%E7%A0%81%E5%92%8C%E8%AF%BB%E5%86%99% e6%b3%a8%e6%84%8f/

File encoding and read-write attention under Windows

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.