C # Reading text files (unknown encoding)

Source: Internet
Author: User

The following figure shows the example file named test.txt, Which is saved by VC. The encoding format is unknown. There are only two strings, as shown in:

Figure 1

The row data is arranged in a certain format. The first six bytes indicate the person's name (string), and the last two bytes indicate the age (integer value ). For example, the first line of "Xiong Wenwen 28" indicates that Xiong Wenwen is 28 years old. Note: "?" in the second line "?" It is not a Chinese Question mark or an English question mark. It is displayed only when there is no corresponding character in windows? . For example, this "?" The corresponding two bytes: C6 32.

Figure 2

Based on the hexadecimal Byte encoding, it is estimated that the text should not be unicode encoded, because there is no BOM in the header.

Now we need to write a program in C # to read the text file above and extract the names and ages of each row.

The Code is as follows:

Public void readfiledata (string filename)

{

// System. Text. Encoding encode = system. Text. encoding. default;

System. Text. Encoding encode = system. Text. encoding. getencoding ("gb2312 ");

// Default indicates the encoding of the operating system. Generally, the Chinese operating system is encoding. getencoding ("gb2312"), but other systems are different.

// Therefore, the use of encoding. getencoding ("gb2312") is actually more accurate here, so that sometimes the data is read in English or other operating systems.

Using (streamreader sr = new streamreader (filename, encode ))

{

If (Sr = NULL)

{

Return;

}

Int nrow = 0; // The row number.

String slinebuf = NULL; // row data cache

String sname = ""; // person name

Int Nage = 0; // age

While (slinebuf = Sr. Readline ())! = NULL )//

{

Nrow ++;

If (slinebuf = "")

Continue; // skip the current row if it is an empty string

Sname = getsubstring (slinebuf, 0, 6); // read the name of the person (first 6 bytes)

String sage = getsubstring (slinebuf, 6, 2); // The last two bytes

Nage = convert. toint16 (SAGE); // read age

}

}

}

Because the substring () function of the character string to take the sub-string is intercepted by the number of characters, rather than by the number of bytes. Therefore, I wrote a function to extract sub-strings by the number of bytes: getsubstring.

/// <Summary>

/// A string starting from a position in a string (measured in bytes rather than characters) (the length of the string is measured in bytes rather than characters)

/// </Summary>

/// <Param name = "SSTR"> source string </param>

/// <Param name = "nstart"> Start position of the string </param>

/// <Param name = "nbyte"> String Length </param>

/// <Returns> return string </returns>

Public static string getsubstring (string SSTR, int nstart, int nbyte)

{

String tstr = "";

Byte [] sbytes = system. Text. encoding. getencoding ("gb2312"). getbytes (SSTR); // convert to a byte array

If (sbytes. Length = 0)

Return tstr;

If (nstart> sbytes. length)

Return tstr;

Byte [] tbytes = new byte [nbyte];

Int I = nstart;

Int J = 0;

While (I <sbytes. Length & J <nbyte)

{

Tbytes [J ++] = sbytes [I ++];

}

Try

{

Tstr = system. Text. encoding. getencoding ("gb2312"). getstring (tbytes); // convert it to a string

}

Catch (system. Exception ex)

{

Throw ex;

}

Return tstr;

}

After the above program is written, start to read data. The output result is as follows:

Person Name: Age

Bear selector: 28

John? 2: 0

The first row of data is okay, but the second row is faulty. It should have been"John?"The age is"20"The read result is"John? 2"The age is"0"Year old, obviously incorrect. Why is such a result?

After debugging, you will find that the hexadecimal format for C # To read the row is:D5 C5 C8 FD 3f 32 30But it should have beenD5 C5 C8 FD C6 32 32 30(See figure 2 ). OriginalC6 32Changed3f(3fThe corresponding character is the English question mark), so the number of bytes is missing. The first six bytes are truncated.D5 C5 C8 FD 3f 32The corresponding string is"John? 2", Followed by only one byte30The corresponding character is 0. So the final result is"John? 2"The age is"0".

The comparative analysis shows that the first row of data does not containC6 32", So it is normal to use C # To read the result. The second line contains"C6 32If you use C # to read such a character, it is automatically converted to 3f. If"C6 32", Such as"BD 32. Why? I think this encoding is not available in gb2312, so it cannot be identified. The character is automatically converted to English "?", Of course, it cannot be read and displayed correctly.

Like the text data above (the text encoding method is unknown, but it contains Chinese and English characters, or even other characters except Chinese and English), how can I read data without errors in C?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.