See the original http://www.catb.org/esr/structure-packing. The author is a famous hacker. Although it is about a basic concept in C language, it is well organized and can still be inspired after reading. Especially the stride address of the struct mentioned in this article. I didn't notice it before!
1. Who should read this article?
This article describes how to reduce the memory usage of the C program: manually rearrange the member declaration of the C struct to reduce the size. To understand it, you need basic knowledge of C language.
If you want to write code for an embedded system or operating system kernel with limited memory, you need to understand this technology. This technique is useful if you often exceed the memory limit when processing a large amount of application data or want to reduce the number of cache hits.
Finally, understanding this technology is the entrance to other obscure C language concepts. You are not a senior C programmer unless you have mastered it. You are not a master of C language unless you can write such a file and comment it intelligently.
2. Why did I write this article?
The reason for writing this article is that at the end of 2013 I found myself using a large number of C language optimization technologies, which were rarely used since I learned more than 20 years ago.
My program uses thousands or even tens of thousands of C-structure instances, and I need to reduce the memory usage. This program is cvs-fast-export. when processing a huge source code library, it will exit because of insufficient memory.
In this case, there is a way to greatly reduce the memory usage, for example, carefully rearranging the order of structure members. This can achieve obvious results: taking my case as an example, I can reduce the memory usage at work by 40% so that the program can process a larger source code library without exiting.
When dealing with problems and recalling my practices, I realized that this technology was mostly forgotten today. To do a simple web search, we can see that at least where the search engine can see it, C programmers have not discussed it much. Several Wikipedia entries mentioned it, but I don't think it's comprehensive.
This phenomenon is also excitable. Computer Courses (correctly) guide people to find better algorithms without micro-optimization. The decrease in hardware prices also makes it unnecessary to squeeze memory usage. Also, hackers used this technology in the past and often hit the wall in a strange hardware architecture. Of course, this situation is rare now.
However, this technology is still useful in important cases and will be useful as long as there are memory restrictions. The purpose of this article is to prevent C programmers from discovering this technology again so that they can focus on more important things.
3. Alignment requirements
The first thing to understand is that on a Modern processor, the C compiler is limited in storing basic data types in the memory: aiming at the fastest access speed.
On X86 or ARM, the basic data type is not stored on any memory address. Each type has alignment requirements except char. char can start at any address, but the two-byte short type must be stored on an even address, the four-byte integer or floating point type must be placed on the location where the data can be divisible by four, while the eight-byte long or double type must be placed on the address where the data can be divisible by eight. There is no difference between signed or unsigned objects.
In terms of terms, the basic C type is self-aligned on both X86 and ARM ). Pointers, whether 32-bit (4 bytes) or 64-bit (8 bytes) are also self-aligned.
Auto-alignment can save faster because it can use a command to access this type of data. On the other hand, if there is no alignment restriction, the Code may use more than two commands for cross-machine word boundary access. Character is a special case: no matter where they are in the machine word, the access cost is the same. Therefore, they do not have alignment requirements.
I said "on modern processors" because on some older processors, it forces your C code to violate alignment restrictions (for example, converting an odd address to an int pointer and trying to use it) not only slows down your code, but also causes an invalid command exception. For example, this is the case on the Sun-based iSCSI chip. In fact, you can trigger this exception on X86 as long as you have enough determination and the correct hardware identifier (e18.
Auto-alignment is not the only rule. Historically, some processors (especially those without barrel shifters) have stricter rules. If you are working on an embedded system, you may hit these reefs. Be prepared.
Sometimes you can make the compiler not follow the normal alignment rules of the processor. Generally, pragma is used, for example, # pragma pack. Do not use it at will because it generates code with higher overhead and lower speed. By using the technology I introduced, you can save the same or even more memory.
The only reasonable reason for using the # pragma pack is that you need the C data distribution to fully match certain hardware or protocols, such as a physical port mapped by memory, it cannot be implemented without violating the alignment rules. If you are in that situation and do not understand the content of this article, you will encounter great troubles and wish you good luck.
4. padding)
Now let's look at a simple example: the distribution of variables in the memory. Consider the following variables at the top of the C module:
char *p;char c;int x;
If you do not know the Data Alignment, You may assume that these three variables occupy consecutive bytes in the memory. That is, on a 32-bit machine, the 4-byte pointer is followed by a 1-byte char, followed by a 4-byte int. On 64-bit machines, the only difference is that the pointer is 8 bytes.
This is the actual situation (on x86 or ARM or any self-aligned machine): p is stored in 4-byte or 8-byte alignment positions (determined by the machine's font length ). This is the best possible condition for pointer alignment.
C storage is closely followed by p. However, the 4-byte alignment of x creates a gap, as if there is a fourth variable inserted in it:
char *p; /* 4 or 8 bytes */char c; /* 1 byte */char pad[3]; /* 3 bytes */int x; /* 4 bytes */
Pad [3] array indicates that three bytes are wasted. The old saying is "slop (overflow )".
Compare what if x is a 2-byte short:
char *p;char c;short x;
In this case, the actual memory distribution is as follows:
char *p; /* 4 or 8 bytes */char c; /* 1 byte */char pad[1]; /* 1 byte */short x; /* 2 bytes */
On the other hand, if it is on a 64-bit machine, x is a long:
char *p;char c;long x;
We will get:
char *p; /* 8 bytes */char c; /* 1 bytechar pad[7]; /* 7 bytes */long x; /* 8 bytes */
If you carefully see this, you may think about how to put a shorter type of variable in front of it:
char c;char *p;int x;
If the actual memory distribution is written as follows:
char c;char pad1[M];char *p;char pad2[N];int x;
What should M and N be?
First, N is 0. The address of x is followed by p, which ensures that x is Pointer aligned, and pointer alignment must be stricter than integer alignment.
C is very likely to be mapped to the first byte of the machine word. Therefore, M allows p to meet the number of pointer alignment-3 on 32-bit machines and 7 on 64-bit machines.
If you want to make these variables occupy less space, you can switch the positions of x and c:
char *p; /* 8 bytes */long x; /* 8 bytes */char c; /* 1 byte
Generally, for a scalar in a small number of C Programs, it is no big deal to adjust the several bytes of the partition obtained by adjusting the Declaration Order. This technology will become more interesting if it is applied to non-standard variables, especially the structure.
Before proceeding, let's talk about scalar arrays. On an auto-alignment platform, the char/short/int/long/pointer array is not filled. Each member is automatically aligned after the previous Member.
In the next section, we will see that the above rules are not necessarily correct in the structure data.
5. Structure alignment and filling
In general, the struct instance is aligned with its widest member. The compiler does this because this is the easiest way to ensure that all members are self-aligned for fast access.
In C, the structure address is equal to the address of its first member-no leading padding. Note: In C ++, structured classes may break this rule! (It depends on how the base class and virtual functions are implemented, and also varies with the compiler .)
(When you have doubts about this, you can use the offset () macro provided by ansi c to get the offset of the structure member .)
Consider this structure:
struct foo1 { char *p; char c; long x;};
Assuming it is on a 64-bit machine, all struct foo1 instances are 8-byte aligned. The memory distribution should be like this:
struct foo1 { char *p; /* 8 bytes */ char c; /* 1 byte char pad[7]; /* 7 bytes */ long x; /* 8 bytes */};
These variables are declared separately. However, if we put c in the first place, it would not be like this:
struct foo2 { char c; /* 1 byte */ char pad[7]; /* 7 bytes */ char *p; /* 8 bytes */ long x; /* 8 bytes */};
If it is declared separately, c can be on any byte boundary, and the pad size will be different. However, Because struct foo2 has the pointer alignment of the widest member, the above situation is impossible. Now c must be in the pointer alignment position, followed by the locked 7-byte filling.
Now we will discuss the trailing padding of the structure ). To explain this, I need to introduce a basic concept, known as stride address. It is the first address of the data that follows the struct and has the same alignment with the struct. The general rule of tail padding is that the tail padding of a struct continues to reach its cross-step address. This rule determines the return value of sizeof.
Consider this example on a 64-bit x86 or ARM machine:
struct foo3 { char *p; /* 8 bytes */ char c; /* 1 byte */};struct foo3 singleton;struct foo3 quad[4];
You may think that sizeof (struct foo3) will return 9, which is actually 16. The Xtep address is quad [0]. p address. In this way, each member in the quad array has a 7-byte tail fill, because the first member of the next struct needs to be aligned on the 8-byte boundary. The memory distribution is declared as follows:
struct foo3 { char *p; /* 8 bytes */ char c; /* 1 byte */ char pad[7];};
For comparison, consider this example:
struct foo4 { short s; /* 2 bytes */ char c; /* 1 byte */};
Because s only requires two-byte alignment, the cross-step address is only one byte after c, and struct foo4 only has one byte of trailing padding. Like this:
struct foo4 { short s; /* 2 bytes */ char c; /* 1 byte */ char pad[1];};
And sizeof (struct foo4) returns 4.
Now let's consider bitfields ). They allow you to declare members smaller than the byte width, as low as 1 bit, for example:
struct foo5 { short s; char c; int flip:1; int nybble:4; int septet:7;};
Bit fields need to be understood that they are implemented by the word or byte layer mask and shift instructions. From the compiler's point of view, the bit fields in struct foo5 are like a 2-byte, 16-bit character array and only 12 characters are used. To make the length of a struct an integer multiple of its widest member length (that is, sizeof (short), there is also a byte fill:
struct foo5 { short s; /* 2 bytes */ char c; /* 1 byte */ int flip:1; /* total 1 bit */ int nybble:4; /* total 5 bits */ int septet:7; /* total 12 bits */ int pad1:4; /* total 16 bits = 2 bytes */ char pad2; /* 1 byte */};
This is the last important detail: If your struct contains a struct, the struct must be aligned with the longest scalar. Suppose you have defined this:
struct foo6 { char c; struct foo5 { char *p; short x; } inner;};
The char * p Member not only places the outer struct but also places the inner struct In the pointer alignment. The actual memory distribution on a 64-bit machine is as follows:
struct foo6 { char c; /* 1 byte*/ char pad1[7]; /* 7 bytes */ struct foo6_inner { char *p; /* 8 bytes */ short x; /* 2 bytes */ char pad2[6]; /* 6 bytes */ } inner;};
This structure shows how much space can be saved from the members of the rearranging structure. Of the 24 bytes, 13 are filled! Over 50% of the space is wasted!
6. structure member rescheduling
After understanding the reason and method for the compiler to insert the padding in the middle and end of the struct, we need to check how to squeeze the overflow (slop ). This is the structure compression technology.
First, we noticed that overflow only occurs in two places. A large data type (which requires stricter alignment) follows a small data type. The other is that the final position of the struct needs to be filled between the Xtep address, so that the next identical structure can be correctly aligned.
The simplest way to eliminate overflow is to sort members by decreasing the alignment value. That is to say, let the members of the pointer alignment be placed at the top, because on 64-bit machines they are 8 bytes, then 4-byte int, then 2-byte short, and then the character.
Therefore, the following uses a simple linked list structure as an example:
struct foo7 { char c; struct foo7 *p; short x;};
Specify the implied overflow:
struct foo7 { char c; /* 1 byte */ char pad1[7]; /* 7 bytes */ struct foo7 *p; /* 8 bytes */ short x; /* 2 bytes */ char pad2[6]; /* 6 bytes */};
A total of 24 bytes. If the sorting is by length, it is:
struct foo8 { struct foo8 *p; short x; char c;};
Considering self-alignment, we found that no data field needs to be filled. Because the cross-step addresses of members with strict alignment requirements are always legal alignment addresses for data with less stringent alignment requirements. The repackaged struct only needs to be filled by tail:
struct foo8 { struct foo8 *p; /* 8 bytes */ short x; /* 2 bytes */ char c; /* 1 byte */ char pad[5]; /* 5 bytes */};
Note that shuffling does not guarantee space saving. Apply it to the previous example, struct foo6, And we get:
struct foo9 { struct foo9_inner { char *p; /* 8 bytes */ int x; /* 4 bytes */ } inner; char c; /* 1 byte*/};
Specify the filling:
struct foo9 { struct foo9_inner { char *p; /* 8 bytes */ int x; /* 4 bytes */ char pad[4]; /* 4 bytes */ } inner; char c; /* 1 byte*/ char pad[7]; /* 7 bytes */};
It is still 24 bytes, because c cannot be filled with the tail of the inner structure. To save the space, you need to redesign the data structure.
7. Weird Data Types
If the symbol debugger can display Enumeration type names instead of original numbers, it is good to use enumeration instead of # define. However, although enumeration must be compatible with a certain integer, the C standard does not specify the type of integer.
Be careful when repackaging the struct, the enumerated variables are usually int, which is related to the compiler; but they may also be short, long, or even char by default. Your compiler may have progma or command line options to specify the enumerated size.
Long double is a similar fault point. Some C platforms implement it with 80 bits, some are 128 bits, and some 80 bits fill it with 96 or 128 bits.
In the above two cases, it is best to use sizeof () to check the storage size.
8. readability and cache locality
Rearranging by member size is the easiest way to eliminate overflow, but not necessarily the right way. There are two other problems:
The program not only communicates with computers, but also with humans. In particular, code readability is more important when the audience is talking about you in the future.
A clumsy, mechanical re-arrangement may affect readability. If possible, it is best to rearrange the Members like this: Put semantic-related data together to form a coherent group. Ideally, the structure design should communicate with the program design.
When your program frequently accesses a structure or a part of it, if the access can always be put into a cache row, it is very helpful to improve performance. The cache row is such a memory block. When the processor wants to fetch any single address in the memory block, the entire memory block will be taken out. On 64-bit x86, a cache row is a 64-byte address starting with a self-aligned address. It is usually 32 bytes on other platforms.
What you do to maintain readability-placing relevant and simultaneously accessed data in adjacent locations-also improves the cache row locality. They are all the reason for smartly shuffling and putting the data access mode at the heart.
If your code accesses a struct from multiple threads at the same time, there will be a third problem: cache line bouncing. To reduce expensive bus communication, you should arrange data so that you can read data from a cache line in a tighter loop and write data to another one.
Yes, this practice is in conflict with the previous practice of putting the relevant data into a block with the same length as the cache line. Multithreading is difficult. Cache line bouncing and other multi-thread optimization problems are very advanced topics, and it is worth writing a guide for them separately. What I can do here is to let you know that these problems exist.
When slimming your structure, reordering works best with other technologies. If you have several Boolean signs in the structure, you can compress them into a one-bit field and package them into a slop (overflow) Place.
You may have a little loss of access time-but if it compresses the workspace small enough, the loss can be compensated by avoiding cache miss.
The general principle is to select a method that can shorten the data type. Taking cvs-fast-export as an example, one of the compression methods I use is that the fact that we did not exist before October 1, 1982 using the cross-site scripting (RCS) and CVS, I deprecated 64-bit Unix time_t (zero since January 1,), and used a 32-bit offset starting from January 1,: 00: 00; in this way, the date will cover 2118. (Note: If you use this technique, use boundary condition check to prevent annoying bugs !)
Each shortening method not only reduces the visible size of the structure, but also eliminates overflow or creates additional opportunities for re-sorting. The positive interaction of this effect is not difficult to trigger.
The most risky packaging method is to use union. If you know that some fields in the struct will never be used with other fields, consider using union to share the bucket. However, please be very careful to use regression testing to verify your practice. If your analysis has a slight error, there will be subtle data corruption from program crash to (worse.
10. Tools
There is a pahole tool that I have never used, but some feedback says it is quite good. This tool works collaboratively with the compiler to output reports on the padding, alignment, and cache line boundary of struct.
11. Proof and exception this applet demonstrates assertions about scalar and struct sizes. You can download packtest. c.
If you carefully examine the odd combinations of compilers, options, and rare hardware, you will find exceptions to the rules I mentioned earlier. The older the processor is, the more common the design exceptions are.
The second level of understanding is when and how to expect these rules to be destroyed. In the days when I learned about them (in the early 1980 s), we called people who didn't understand them the victim of "all machines in the world are victims of VAX syndrome. Remember, not all computers in the world are PCs.
12. Version
1.3 @
Add
1.2 @
Correct an incorrect address calculation.
1.1 @
Explains why alignment access is faster. Mentioned offsetof. Add packtest. c.
1.0 @
The initial version.