C # Reading text files (unknown encoding)

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following figure shows the example file named test.txt, Which is saved by VC. The encoding format is unknown. There are only two strings, as shown in:

Figure 1

The row data is arranged in a certain format. The first six bytes indicate the person's name (string), and the last two bytes indicate the age (integer value ). For example, the first line of "Xiong Wenwen 28" indicates that Xiong Wenwen is 28 years old. Note: "?" in the second line "?" It is not a Chinese Question mark or an English question mark. It is displayed only when there is no corresponding character in windows? . For example, this "?" The corresponding two bytes: C6 32.

Figure 2

Based on the hexadecimal Byte encoding, it is estimated that the text should not be unicode encoded, because there is no BOM in the header.

Now we need to write a program in C # to read the text file above and extract the names and ages of each row.

The Code is as follows:

Public void readfiledata (string filename)

{

// System. Text. Encoding encode = system. Text. encoding. default;

System. Text. Encoding encode = system. Text. encoding. getencoding ("gb2312 ");

// Default indicates the encoding of the operating system. Generally, the Chinese operating system is encoding. getencoding ("gb2312"), but other systems are different.

// Therefore, the use of encoding. getencoding ("gb2312") is actually more accurate here, so that sometimes the data is read in English or other operating systems.

Using (streamreader sr = new streamreader (filename, encode ))

{

If (Sr = NULL)

{

Return;

}

Int nrow = 0; // The row number.

String slinebuf = NULL; // row data cache

String sname = ""; // person name

Int Nage = 0; // age

While (slinebuf = Sr. Readline ())! = NULL )//

{

Nrow ++;

If (slinebuf = "")

Continue; // skip the current row if it is an empty string

Sname = getsubstring (slinebuf, 0, 6); // read the name of the person (first 6 bytes)

String sage = getsubstring (slinebuf, 6, 2); // The last two bytes

Nage = convert. toint16 (SAGE); // read age

}

Because the substring () function of the character string to take the sub-string is intercepted by the number of characters, rather than by the number of bytes. Therefore, I wrote a function to extract sub-strings by the number of bytes: getsubstring.

/// <Summary>

/// A string starting from a position in a string (measured in bytes rather than characters) (the length of the string is measured in bytes rather than characters)

/// </Summary>

/// <Param name = "SSTR"> source string </param>

/// <Param name = "nstart"> Start position of the string </param>

/// <Param name = "nbyte"> String Length </param>

/// <Returns> return string </returns>

Public static string getsubstring (string SSTR, int nstart, int nbyte)

{

String tstr = "";

Byte [] sbytes = system. Text. encoding. getencoding ("gb2312"). getbytes (SSTR); // convert to a byte array

If (sbytes. Length = 0)

Return tstr;

If (nstart> sbytes. length)

Return tstr;

Byte [] tbytes = new byte [nbyte];

Int I = nstart;

Int J = 0;

While (I <sbytes. Length & J <nbyte)

{

Tbytes [J ++] = sbytes [I ++];

}

Try

{

Tstr = system. Text. encoding. getencoding ("gb2312"). getstring (tbytes); // convert it to a string

}

Catch (system. Exception ex)

{

Throw ex;

}

Return tstr;

}

After the above program is written, start to read data. The output result is as follows:

Person Name: Age

Bear selector: 28

John? 2: 0

The first row of data is okay, but the second row is faulty. It should have been"John?"The age is"20"The read result is"John? 2"The age is"0"Year old, obviously incorrect. Why is such a result?

After debugging, you will find that the hexadecimal format for C # To read the row is:D5 C5 C8 FD 3f 32 30But it should have beenD5 C5 C8 FD C6 32 32 30(See figure 2 ). OriginalC6 32Changed3f(3fThe corresponding character is the English question mark), so the number of bytes is missing. The first six bytes are truncated.D5 C5 C8 FD 3f 32The corresponding string is"John? 2", Followed by only one byte30The corresponding character is 0. So the final result is"John? 2"The age is"0".

The comparative analysis shows that the first row of data does not containC6 32", So it is normal to use C # To read the result. The second line contains"C6 32If you use C # to read such a character, it is automatically converted to 3f. If"C6 32", Such as"BD 32. Why? I think this encoding is not available in gb2312, so it cannot be identified. The character is automatically converted to English "?", Of course, it cannot be read and displayed correctly.

Like the text data above (the text encoding method is unknown, but it contains Chinese and English characters, or even other characters except Chinese and English), how can I read data without errors in C?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # Reading text files (unknown encoding)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C # Reading text files (unknown encoding)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support