Encoding when the. NET program reads text files

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ProgramText encoding is a very annoying problem. Learn more about coding and avoid detours when writing programs.

When using a. NET application to read text files, we seldom pay attention to file encoding. In most cases, it identifies the file encoding. The simplest and most commonly used statement for reading text files: system. io. file. readalltext (filepath) (UTF-8 encoding is used by default). As long as the file path is passed in, the text content can be correctly read. If garbled characters occur occasionally, you only need to pass in the default encoding system. Text. encoding. Default to solve the problem. Usually. Net programmers (including me) seldom care about coding. Here, system. Text. encoding is the encoding base class in the. NET base class library. All encoding types inherit from it, such as utf8encoding and unicodeencoding. Default is an attribute of the encoding class, indicating the current ANSI of the Operating SystemCodePage encoding. This attribute value varies with operating systems in different languages, cultures, and regions. On my computer, default indicates system. Text. dbcscodepageencoding. A simple statement can be used to obtain the default type. In ironpython, print system. text. encoding. default; in C #, you can use console. writeline (system. text. encoding. default ). About system. Text. dbcscodepageencoding, I didn't find its definition on msdn, but I learned from Google that it is an encoding that represents the Asian language. In this case, it should be able to recognize texts such as gb2312 or GBK encoding, and it should be powerless for Unicode or other encoded texts. However, it seems that this is not the case. The following is a test:

Open notepad, enter "China", and save it as a file of different encoding types. For example, if bytes is saved as 8.txt, ansi.txt is saved as ansi.txt, and unicode.txt is saved as nicde.txt. Now we can read the contents of these three files through system. Io. File. The following is the ironpython code:

Import System basedir = r'd: // ' Print system. io. file. readalltext (basedir + '00008.txt ') Print'-'* 50 Print system. io. file. readalltext (basedir + 'unicode.txt ') Print'-'* 50 Print system. io. file. readalltext (basedir + 'ansi.txt ')

The running result is:

System. io. file. by default, readalltext uses UTF-8 to decode files. It can correctly decode the files stored in UTF-8 format. For ANSI files, garbled characters are expected. However, for text stored in unicode format, it even correctly reads the content, which is very regrettable.

Now let's continue my experiment. When reading files, we explicitly provide the file encoding type: system. Text. encoding. Default. The code is:

Import System basedir = r'd: // ' Print system. io. file. readalltext (basedir + 'audio 8.txt ', system. text. encoding. default) Print '-' * 50 Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. default) Print '-' * 50 Print system. io. file. readalltext (basedir + 'ansi.txt ', system. text. encoding. default)

The running result is:

The program correctly identifies all text encoding. System. Text. encoding. Default is of the system. Text. dbcscodepageencoding type in the above program. It can recognize ANSI, which is of course, but why can it recognize UTF-8 and Unicode?

The reason is actually very simple: when notepad saves files, it makes some effort and adds some additional information. For example, when a file is saved as UTF-8, notepad adds 0xef, 0xbb, and 0xbf to the text, that is, bom. When a file is saved as Unicode, 0xfe is added, 0xff.

When the. NET program is reading text, itFirstCheck the header of the text to see if there is an identifier similar to 0xef0xbb0xbf. If yes, the text is decoded according to the encoding format indicated by these identifiers. If no such header information exists, the text is decoded based on the input encoding type. The following is a verification:

Remove the file header text:

Use the default encoding method to read text:

Basedir = r 'd: // ' Print system. io. file. readalltext (basedir + 'utf8.txt ', system. text. encoding. default) Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. default)

Running result:

Without a file header, dbcscodepageencoding is used to decode UTF-8 and Unicode texts. The results are garbled. Modify the program:

Basedir = r 'd: // ' Print system. io. file. readalltext (basedir + 'utf8.txt ') # encoding is not specified. UTF-8 encoding is used by default. Print system. io. file. readalltext (basedir + 'unicode.txt ', system. text. encoding. unicode)

Result:

The result is correct.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Encoding when the. NET program reads text files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Encoding when the. NET program reads text files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support