Encoding when the. NET program reads text files

Source: Internet
Author: User

ProgramText encoding is a very annoying problem. Learn more about coding and avoid detours when writing programs.

When using a. NET application to read text files, we seldom pay attention to file encoding. In most cases, it identifies the file encoding. The simplest and most commonly used statement for reading text files: system. io. file. readalltext (filepath) (UTF-8 encoding is used by default). As long as the file path is passed in, the text content can be correctly read. If garbled characters occur occasionally, you only need to pass in the default encoding system. Text. encoding. Default to solve the problem. Usually. Net programmers (including me) seldom care about coding. Here, system. Text. encoding is the encoding base class in the. NET base class library. All encoding types inherit from it, such as utf8encoding and unicodeencoding. Default is an attribute of the encoding class, indicating the current ANSI of the Operating SystemCodePage encoding. This attribute value varies with operating systems in different languages, cultures, and regions. On my computer, default indicates system. Text. dbcscodepageencoding. A simple statement can be used to obtain the default type. In ironpython, print system. text. encoding. default; in C #, you can use console. writeline (system. text. encoding. default ). About system. Text. dbcscodepageencoding, I didn't find its definition on msdn, but I learned from Google that it is an encoding that represents the Asian language. In this case, it should be able to recognize texts such as gb2312 or GBK encoding, and it should be powerless for Unicode or other encoded texts. However, it seems that this is not the case. The following is a test:

Open notepad, enter "China", and save it as a file of different encoding types. For example, if bytes is saved as 8.txt, ansi.txt is saved as ansi.txt, and unicode.txt is saved as nicde.txt. Now we can read the contents of these three files through system. Io. File. The following is the ironpython code:

Import System <br/> basedir = r'd: // '<br/> Print system. io. file. readalltext (basedir + '00008.txt ') <br/> Print'-'* 50 <br/> Print system. io. file. readalltext (basedir + 'unicode.txt ') <br/> Print'-'* 50 <br/> Print system. io. file. readalltext (basedir + 'ansi.txt ') <br/>

The running result is:

System. io. file. by default, readalltext uses UTF-8 to decode files. It can correctly decode the files stored in UTF-8 format. For ANSI files, garbled characters are expected. However, for text stored in unicode format, it even correctly reads the content, which is very regrettable.

Now let's continue my experiment. When reading files, we explicitly provide the file encoding type: system. Text. encoding. Default. The code is:

 Import System <br/> basedir = r'd: // '<br/> Print system. io. file. readalltext (basedir + 'audio 8.txt ', system. text. encoding. default) <br/> Print '-' * 50 <br/> Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. default) <br/> Print '-' * 50 <br/> Print system. io. file. readalltext (basedir + 'ansi.txt ', system. text. encoding. default) <br/>

The running result is:

The program correctly identifies all text encoding. System. Text. encoding. Default is of the system. Text. dbcscodepageencoding type in the above program. It can recognize ANSI, which is of course, but why can it recognize UTF-8 and Unicode?

The reason is actually very simple: when notepad saves files, it makes some effort and adds some additional information. For example, when a file is saved as UTF-8, notepad adds 0xef, 0xbb, and 0xbf to the text, that is, bom. When a file is saved as Unicode, 0xfe is added, 0xff.


When the. NET program is reading text, itFirstCheck the header of the text to see if there is an identifier similar to 0xef0xbb0xbf. If yes, the text is decoded according to the encoding format indicated by these identifiers. If no such header information exists, the text is decoded based on the input encoding type. The following is a verification:

Remove the file header text:

Use the default encoding method to read text:

Basedir = r 'd: // '<br/> Print system. io. file. readalltext (basedir + 'utf8.txt ', system. text. encoding. default) <br/> Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. default)

Running result:

Without a file header, dbcscodepageencoding is used to decode UTF-8 and Unicode texts. The results are garbled. Modify the program:

Basedir = r 'd: // '<br/> Print system. io. file. readalltext (basedir + 'utf8.txt ') # encoding is not specified. UTF-8 encoding is used by default. <br/> Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. unicode)

Result:

The result is correct.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.