Use C # To Read Binary files

Source: Internet
Author: User

It is a good thing to think that all files are converted to XML. However, this is not true. There are still a large number of file formats, not XML, or even ASCII. Binary files are still transmitted over the network, stored on disks, and transmitted between applications. In contrast, they are more efficient than text files in dealing with these problems.


In C and C ++, it is easy to read binary files. Except for the carriage return and line feed problems, every file read in C/C ++ is a binary file. In fact, C/C ++ only knows binary files and How to Make binary files like text files. As the languages we use become more and more abstract, the last language we use cannot directly and easily read the created files. These languages want to use their own unique methods to automatically process output data.

Problem
In many computer science fields, C and C ++ still store and read data directly according to the data structure. In C and C ++, it is very simple to read and write files according to the data structure in the memory. In C, you only need to use the fwrite () function and provide the following parameters: a pointer to your data, telling it how much data it has and how big it is. In this way, the data is directly written into a file in binary format.

As mentioned above, data is written into a file. If you know the correct data structure, it means that it is easy to read the file. You only need to use the fread () function and provide the following parameters: a file handle, a pointer to data, how much data to read, and the length of each data. The fread () function helps you do everything else. Suddenly, the data returned to the memory. No parsing or object model is used, but the file is directly read into the memory.

In C and C ++, the two biggest problems are structure alignment and byte swapping ). Data Alignment means that sometimes the compiler skips the bytes in the middle of the data, because if the processor accesses those bytes, it is no longer in the optimized state and takes more time (generally, it takes twice the time for the processor to access the non-aligned data) and more commands are consumed. Therefore, the compiler needs to optimize the execution speed by skipping those bytes and re-sorting them. On the other hand, byte swap refers to the process of re-sorting the data bytes because different processors have different sorting methods.

Data Alignment
Because the processors can process more information at a time (within a clock cycle), they want the information they process to be sorted in a definite way. Most Intel processors Enable the first address of INTEGER (32-bit) storage to be 4 (I .e., from the address that can be 4 ). If the integers in the memory are not stored in an address multiple of 4, they do not work. The compiler knows this. Therefore, when the compiler encounters data that may cause this problem, they have the following three options.

First, they can choose to add some useless white space characters to the data, so that the start address of the integer can be 4 out. This is the most common practice. Second, they can re-order the fields so that the integers are placed on the four-digit boundary. This will cause other interesting problems, so this method is rarely used. The third option is to allow integers in the data not to be on the four-digit boundary, but to copy the code to a proper place so that those integers are on the four-digit boundary. This method requires some additional time, but it is useful if compression is required.

All of the above are details of the compiler, so you don't need to worry too much. If you use the same compiler and settings for the Data Writing Program and the data reading program, it will not be a problem. The compiler uses the same method to process the same data. Everything is OK. However, when you involve cross-platform file conversion, it is very important to arrange all the data in the correct way so that information can be converted. In addition, some programmers also know how to make the compiler ignore their data.
Byte swapping: big endians and little endians)

High priority and low priority indicate two different ways to store integers in a computer. Because the integer is more than one byte, the question is: whether the most important byte should be read and written first. The least important byte is the most frequently changed. That is, if you constantly add an integer, the least important byte will be changed 256 times, and the second unimportant byte will only change once.

Different processors store integers in different ways. Intel processors generally store integers in low priority mode. In other words, low priority is read and written first. Most other processors use the high priority method to store integers. Therefore, when binary files are read and written on different platforms, you may have to re-sort the bytes to get the correct order.

There is also a special problem on the UNIX platform, because UNIX can run on a variety of processors, such as the Sun or Linux processor, HP processor, IBM Power PC, and Inter chip. When you move from one processor to another, it means that the byte order of those variables must be flipped so that they can meet the order required by the new processor.

Use C # To process binary files
If you use C # To process binary files, there will be two new challenges. The first challenge is that all. NET languages are strongly typed. Therefore, you have to convert the byte stream in the file to the data type you want. The second challenge is that some data types are much more complex than they are on the surface and require some type of conversion.

Type breaking)
Because the. NET language, including C #, is strongly typed, you cannot just read a segment of bytes from the file, and then plug it into the data structure. Therefore, when you want to break the type conversion rules, you have to do so. First, read the required bytes to a byte array, and then copy them from start to end to the data structure.

Search for Usenet (Note: a world-wide newsgroup Network System) documents and you will find several architectures in microsoft. public. A group of programs at the dotnet level that allow you to convert any object into a series of bytes and re-convert it back to the object. They can find Listing A at the address below.

Complex data types
In C ++, you understand what an object is, what an array is, and what is neither an object nor an array. But in C #, things are not as simple as they look. A string is an object and therefore an array. In C #, there are no real arrays and many objects have no fixed sizes. Therefore, some complex data types are not suitable for fixed-size binary data.

Fortunately,. NET provides a way to solve this problem. You can tell C # How you want to process your string and other types of arrays. This will be done through the externalas attribute. In the following example, a string is used in C #. This attribute must be used before the controlled data is used:

[Financialas (UnmanagedType. ByValTStr, SizeConst = 50)]
The length of the string you want to read from a binary file or store it in a binary file determines the size of the SizeConst parameter. In this way, the maximum length of the string is determined.
Solve previous problems

Now you know how the problem introduced by. NET is solved. As you can see later, it is so easy to solve the problem of binary files.

Pack)
You don't have to bother setting the compiler to control how to arrange data. You only need to use the StructLayout attribute to arrange or package data as you wish. This is useful when you need different data with different packaging methods. This is just like dressing up your car. Using the StructLayout attribute is like deciding whether to compress every piece of data or simply pass them away as long as they can be read again. The use of the StructLayout attribute is shown below:

[StructLayout (LayoutKind. Sequential, Pack = 1)]

In this way, the data can ignore the boundary alignment and make the data as compact as possible. This attribute should be consistent with the attributes of any data you read from the binary file (that is, the attributes you write to the file should remain the same as those read from the file ).

You may find that even if you add this attribute to your data, the problem is not completely solved. In some cases, you may have to perform tedious and lengthy experiments. Different computers and compilers have different processing methods at the binary level, which is the cause of the above problem. We must be especially careful when processing binary data, especially across platforms. . NET is a good tool for other binary files, but it is not a perfect tool.

Endian flipping)
One of the classic problems with reading and writing binary files is that some computers first store the least important bytes (such as Inter), while others store the most important bytes first. In C and C ++, You have to manually handle this problem, and only one field can be flipped. However. one of the advantages of the. NET Framework is that code can access metadata (metadata) at runtime, and you can also read information, it is used to automatically solve the problem of the byte arrangement order of each segment of data. You can find the source code on Listing B and learn how to handle it.

Once you know the object type, you can obtain each part of the data, check each part, and determine whether it is a 16-bit or 32-bit unsigned integer. In any of the above cases, you can change the byte sorting order without damaging the data.

Note: you do not use a string class (string) to complete everything. Whether to use high priority or low priority does not affect the string class. Those fields are not affected by the flip code. You only need to pay attention to unsigned integers. Because negative numbers are not represented in the same way on different systems. Negative numbers can be expressed by only one mark (one byte), but more commonly used, they are expressed by two marks (two bytes. This makes negative numbers more difficult across platforms. Fortunately, negative numbers are rarely used in binary files.

This is just a few more words. Similarly, floating-point numbers are sometimes not represented in a standard way. Although most systems use floating point numbers based on the IEEE format, some old systems use other formats to set floating point numbers.

Overcome Difficulties
Although C # still has some problems, you can still use it to read binary files. In fact, because C # uses metadata to access objects, it becomes a language that can better read binary files. Therefore, C # can automatically solve the byte swapping problem of the entire data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.