Design Cornerstone and Practice series of the Lost C language Structure packaging art

Source: Internet
Author: User
Tags scalar

English comes from Eric S. raymond-- The Lost Art of C Structure Packing

who should read this article

This article is a technique for reducing the memory footprint of C language Programs-manually reseal C struct declarations to reduce memory size. You need the basic knowledge of C language to read this article.

If you want to write code for embedded systems that have limitations, or operating system cores, you need to understand this technique. This technique is helpful if you are dealing with a huge set of application data so that your program often reaches the limits of memory. In any application where you really really need to focus on minimizing cache line misses, it's good to know this technique.

Finally, understanding the technology is a gateway to other esoteric C-language topics. Until you have mastered it, you become a high-end C programmer. You become a master of C until you can write this document yourself and judge it rationally.

Why do I write this article ?

This article exists because, at the end of 2013, I found myself using a C-language optimization technique, which I had learned more than 20 years ago, but I haven't used it since.

I need to reduce the memory footprint of a program, which uses thousands of--sometimes hundreds of thousands of instances of a--c struct. This program is Cvs-fast-export, and the problem is that when dealing with a huge code base, it was on the verge of crashing due to memory exhaustion errors.

In such cases, there are several ways to significantly reduce memory usage, such as carefully rearranging the order of struct members. This can make a huge profit-in my case, I was able to lose about 40% of the workspace size, allowing the program to handle a much larger codebase without crashing.

When I solved the problem and recalled the work I did, I began to see that the technology I was using today should be largely forgotten. A web survey confirms that C programmers seem to have stopped talking about the technology, at least where search engines can see it. A few Wikipedia articles touched on this topic, but I found no one to cover it all.

In fact, there are reasonable reasons for this phenomenon. Computer Science courses (should) guide people to avoid the details of optimization to find a better algorithm. The plunge in machine resource prices has made it less necessary to squeeze memory usage. And, in the past, hackers had learned how to use the technology to get them into a wall on unfamiliar hardware architectures-a less common experience now.

But the technology is still valuable on important occasions, and can last forever as long as memory is limited. The purpose of this article is to keep C programmers from re-looking for the technology and to focus on more important things.

Alignment requirements (Alignment requirement)

The first thing to understand is that on modern processors, your C compiler is constrained to store basic C data types in memory for faster memory access.

On x86 or ARM processors, storage of basic C data types is generally not an arbitrary byte address originating in memory. Instead, each type, in addition to the character type, has alignment requirements; characters can start at any byte address, but a 2-byte short integer must start at an even address, a 4-byte integer or float must start at an address that is divisible by 4, and a 8-byte long or double-precision floating-point type must start at an address that is divisible by 8 There is no difference between signed and unsigned.

This jargon is called: on x86 and arm, the basic C language type is self-aligning (self-aligned) . pointers, either 32-bit (4-byte) or 64-bit (8-byte), are also self-aligning.

Self-aligning makes access faster because it allows a single instruction to complete the fetch and save operations on typed data. There is no alignment constraint, which, in turn, may eventually have to be accessed two or more times across the boundaries of the machine word. Characters are special cases, and the cost of access is the same regardless of where in a single machine word. That's why the character type is not suggested to be aligned.

I say "on a modern processor" because, on some old processors, forcing your C program to violate alignment constraints (say, converting an odd address into an integer pointer and trying to use it) will not only slow down your code, but also cause an illegal instruction error. This was done on Sun's SPARC chip, for example. In fact, you can still trigger this error on x86 as long as you are determined and set the correct (E18) hardware flag bit on the processor.

In addition, self-aligning is not the only possible rule. Historically, some processors (especially those with missing shift registers) have stronger restrictive rules. If you do an embedded system, you may fall into these jungle traps. Note that this is possible.

Sometimes you can force your compiler to not use the processor's normal alignment rules, usually #pragmapack, by compiling instructions. Do not use it arbitrarily, as it will result in more expensive and slower code. Using the techniques I describe here, you can usually save the same or almost as much memory.

#pragma pack the only benefit is that if you have to match your C data distribution exactly to the requirements of some bit-level hardware or protocol, such as a memory-mapped hardware port, a violation of normal alignment is required to be effective. If you're in that situation and you don't understand what I'm writing here, you'll be in big trouble, and I can only wish you luck.

padding (Padding)

Now let's look at an example of the distribution of a simple variable in memory. Consider the following series of variable declarations at the top of the C module:

char *p;char c;int x;

If you don't know anything about data alignment, you might assume that these 3 variables occupy a contiguous byte space in memory. That is to say, on a 32-bit machine, the pointer is 4 bytes, followed by a 1-byte character, followed by a 4-byte integer. The 64-bit machine will be different only if the pointer is 8 bytes.

Here is what actually happens (at x86 or arm or any other processor type that has self-aligning). The storage address of P begins with a self-aligning 4-byte or 8-byte boundary, depending on the length of the machine. This refers to alignment-probably the most stringent case.

Immediately following is the storage address of C. But the 4-byte alignment requirement for x, which creates a gap in the memory distribution, becomes as if a fourth variable is inserted, like this:

char *p;      /* 4 or 8 bytes */char C;       /* 1 byte */char pad[3];  /* 3 bytes */int x;        /* 4 bytes */

PAD[3] Character array represents the fact that there are 3 bytes of useless space in the struct. The old-fashioned term is called "slop (Puddle)".

Compare what happens if X is a 2-byte short integer :

char *p;char c;short x;
In that case, the actual memory distribution becomes this:
char *p;      /* 4 or 8 bytes */char C;       /* 1 byte */char pad[1];  /* 1 byte */short x;      /* 2 Bytes */
on the other hand, if X is a long integer on a 64-bit machine

char *p;char C;long x;
In the end we will get:

char *p;     /* 8 bytes */char C;      /* 1 Bytechar pad[7]; /* 7 bytes */long x;      /* 8 bytes */
If you have seen this carefully, now you might think that the shorter the variable declaration first declares the case:

char C;char *p;int x;
If the actual memory distribution is written like this:

Char C;char Pad1[m];char *p;char pad2[n];int x;

Can we say the values of M and N ?

First, in this example,N is zero. The address of X , immediately after P , is to ensure that the pointer is aligned, and certainly more stringent than the integral type alignment.

The value of M is not predictable enough. If the compiler happens to map C to the last byte of the machine word, the next byte (the first part ofP ) becomes the first byte of the next machine word, and the pointer is properly aligned. M is zero.

C is more likely to be mapped to the first byte of a machine word. In that case,M would be a number that would be filled with a guarantee that p refers to alignment--3 on 32-bit machines and 7 on 64-bit machines.

If you want those variables to take up less space, you can do so by swapping the x and C in the original sequence.

char *p;     /* 8 bytes */long x;      /* 8 bytes */char C;      /* 1 byte*/

In general, for a small number of simple variables in a C program, you can compress a few bytes by adjusting the order of declarations without significant savings. But this technique becomes more interesting when used in non-standard quantitative quantities (nonscalarvariables), especially structures.

Before we speak about non-scalar variables, let's talk about scalar arrays. On a platform with a self-aligning type, characters, short integers, integers, long integers, and pointer arrays have no inner padding. Each member is automatically self-aligning to the previous one (translator Note: The original self-aligned at the end of the next one seems to be wrong).

In the next chapter, we see that the same rules are not necessarily correct for struct arrays.

Alignment and padding of the structure body

In general, a struct instance is aligned to its widest scalar member. The compiler does this by taking it as the simplest way to ensure that all members are self-aligning for quick access purposes.

Moreover, in C, the address of the struct is the same as the address of its first member-there is no pre-fill. Note: in C + +, a class that looks like a struct may not follow this rule! (Compliance depends on how the base class and virtual memory functions are implemented and differs from the compiler.) )

(When you are unsure of such things, Ansic provides a offsetof () macro that can be used to represent the offset of a struct member.) )

Consider this structure:

struct FOO1 {    char *p;    char c;    long x;};
Suppose a 64-bit machine, anystruct Foo1are aligned by 8 bytes. The memory distribution of any one of them should certainly look like this:

struct FOO1 {    char *p;     /* 8 bytes */    char C;      /* 1 byte    char pad[7];/* 7 bytes */    long x;      /* 8 bytes */};
Its distribution happens just like these types of variables are declared separately. But if we put C on the first one, that's not it.
struct Foo2 {    char C;      /* 1 byte */    char pad[7];/* 7 bytes */    char *p;     /* 8 bytes */    long x;      /* 8 bytes */};

If the member is a separate variable,C can start at any byte boundary, and the pad size will be different. But because struct FOO2 has a pointer aligned to its widest member, it is not possible. Now C must be aligned, and then the 7-byte padding is locked.

Now let's talk about trailing padding on struct members (trailing padding). To explain this, I need to introduce a basic concept, which I call the structure's stepping address (strideaddress). It is the first address following the structure data, which has the same alignment as the struct body .

The usual rule for struct trailing padding is that the compiler behaves as if the struct is trailing to its stepping address. This rule determines the return value of sizeof() .

Consider this example on a 64-bit x86 or arm:

struct Foo3 {    char *p;     /* 8 bytes */    char C;      /* 1 byte */};struct foo3 singleton;struct Foo3 quad[4];
you might think thatsizeof (struct Foo3) should be 9, but actually 16. The stepping address is the address of (&P) [2] . So, in the Quad array, each member has a trailing padding of 7 bytes, because the first member of each followed struct is self-aligning to a 8-byte boundary. The memory distribution is declared as a struct like this:

struct Foo3 {    char *p;     /* 8 bytes */    char C;      /* 1 byte */    char pad[7];};
As a control, consider the following example:

struct Foo4 {short    s;     /* 2 bytes */    char C;      /* 1 byte */};
because s simply aligns to 2 bytes, the step address is only one byte behind C , and thestruct Foo4 as a whole, requiring only one byte of trailing padding. It will be distributed like this

struct Foo4 {short    s;     /* 2 bytes */    char C;      /* 1 byte */    char pad[1];};

and sizeof (struct Foo4) returns 4.

Now let's consider the bit field (bitfield). These are the structures you can declare that are smaller than the character width, as small as 1 bits, like this:

struct FOO5 {short    s;    char c;    int flip:1;    int nybble:4;    int septet:7;};
What you need to know about bit fields is that they are implemented as a literal or byte-level mask and shift instruction. From the compiler's point of view,struct FOO5The bit fields look like 2 bytes, and only 12 bits of the 16-bit character array are used. This is followed by padding, which makes the byte length of the struct asizeof (short)The size of the longest member.

struct FOO5 {short    s;       /* 2 bytes */    char C;        /* 1 byte */    int flip:1;    /* Total 1 bit */    int nybble:4;  /* Total 5 bits */    int septet:7;  /* Total bits */    int pad1:4;    /* Total, bits = 2 bytes */    char pad2;     /* 1 byte */};
Here is the last important detail: If your structure contains members of the structure, the structure within it also needs to be aligned to the longest scalar. Suppose if you write this:

struct FOO9 {    char C;    struct Foo9_inner {        char *p;        short x;    } inner;};
内部结构体的char *p members make the external struct as pointer-aligned as the interior. On a 64-bit machine, the actual distribution is like this:

struct FOO9 {    char C;           /* 1 byte*/    char pad1[7];     /* 7 bytes */    struct Foo9_inner {        char *p;      /* 8 bytes */short        x;      /* 2 bytes */        char pad2[6];/* 6 bytes */    } inner;};

This structure gives us a revelation that the re-encapsulation of the structure may save space. In 24 bytes, 13 bytes are used as padding. More than 50% of useless space!

Structural weight sequencing (reordering)

Now that you know how and why the compiler wants to insert the fill, in your structure or after that, we'll examine what you can do to squeeze out the puddles. This is the art of structural encapsulation. The first thing to note is that "puddles" occur only in two places. A storage area that is a large data type (with stricter alignment requirements) immediately follows a storage area of a smaller data type. The other is that the struct naturally ends up with its stride address before it needs to be populated so that the next instance can be aligned correctly. The simplest way to eliminate puddles is to reorder struct members in descending order of alignment. That is, all the pointer-aligned subdomains are in front, because on a 64-bit machine, they will have 8 bytes. Next is a 4-byte integer, followed by a 2-byte short integer, followed by a character field.

So, for example, consider this simple list structure:

struct FOO10 {    char C;    struct FOO10 *p;    short x;};
Show the Hidden "puddle", so that:

struct FOO10 {    char C;          /* 1 byte */    char pad1[7];    /* 7 bytes */    struct FOO10 *p;/* 8 bytes */short    x;         /* 2 bytes */    char pad2[6];    /* 6 bytes */};
24 bytes. If we reorder by size, we get:

struct FOO11 {    struct foo11 *p;    short x;    char c;};
Given the self-aligning, we see no data fields that need to be populated. This is because a longer, more tightly aligned domain has a stepping address, and is always a legitimate starting address for a shorter, less tightly aligned domain. All repackaged structs actually need just a trailing fill:

struct FOO11 {    struct foo11 *p;/* 8 bytes */short    x;         /* 2 bytes */    char C;          /* 1 byte */    char pad[5];     /* 5 bytes */};

The transition we re-encapsulated reduced the size to 16 bytes. This may seem like nothing, but suppose you have a 200k of such a list? The amount of space saved is not a small build up. Note reordering does not guarantee space savings. Applying this technique to an earlier example,struct foo9, we get:

struct Foo12 {    struct Foo12_inner {        char *p;      /* 8 bytes */        int x;        /* 4 bytes */    } inner;    char c;           /* 1 byte*/};
Write the fill out, that's it.

struct Foo12 {    struct Foo12_inner {        char *p;      /* 8 bytes */        int x;        /* 4 bytes */        char pad[4];  /* 4 bytes */    } inner;    char c;           /* 1 byte*/    char pad[7];      /* 7 bytes */};
It's still 24 bytes, becauseCYou cannot convert a trailing fill to an internal struct member. To gain space-saving benefits, you need to redesign your data structure. Since the release of the first version of this guide, I have been asked if it is so easy to get the least "puddles" by reordering, why does the C compiler not do it automatically? The answer is: the C language was originally designed to write the operating system and other hardware-close languages. Automatic reordering interferes with the ability of the system programmer to plan the structure and precisely match the bit-level distribution of the byte and memory-mapped device control blocks.

The case of a scalar that is difficult to handle

Using enumeration types instead of #defines is a good idea because the symbol debugger can use those symbols and can display them instead of the unhandled integers. However, although enumerations are guaranteed to be compatible with integer types, the C standard does not explicitly specify which potential integral types will be used.

Note that when you reseal your struct, the enum type variable is usually an integer, but it relies on the compiler; they may be short, long, or even the default character type. Your compiler may have a compile indication or command-line option to enforce the size.

long Double type is also a similar trouble point. Some c platforms are 80-bit implementations, some 128, and 80-bit platforms populated with 96 or 128 bits.

In both cases, it is best to use sizeof () to check the storage size.

Finally, under x86, the dual-precision type of Linux is sometimes a special case of a self-aligning rule; a 8-byte double-precision data can require only 4-byte alignment in a struct, although a separate double-precision variable requires 8-byte self-alignment. This depends on the compiler and its options.

readability and cache locality

Although reordering by size is the simplest way to eliminate puddles, it is not necessarily correct. There are two more issues: readability and cache locality.

The program is not just communication with the computer, but with other people. Code readability is important, even if (or especially!). The other side of the exchange is not just the future of you.

Clumsy, mechanically structured weight sequencing can damage readability. If possible, it is best to rearrange the fields so that the semantic-related data segments are tightly connected and can form a coherent group. Ideally, your structural design should be transmitted to your program.

When your program frequently accesses a struct, or part of a struct, it can help improve performance if you access a regular hit cache line (when you are told to read a single address in any block, the whole piece of memory that your processor reads). A cache behavior of 64 bytes on a 64-bit x86 machine starts with a self-aligning address and is often 32 bytes on other platforms.

The thing you should do is to keep the readability--combining related and simultaneous access data into adjacent areas--which also increases the locality of the cache rows. It's all about the sense of the code's data access pattern, the reason for intelligently reordering.

If your code has multiple threads of concurrent access to a struct, there is a third problem: Cache line Bounce (Cacheline bouncing). To reduce costly bus traffic, you should organize your data so that it is read from one cache line in a compact loop and written in another cache line.

Yes, this is somewhat contradictory to the previous guidelines for putting related data into a cache block of the same size. Multithreading is difficult. Cache line bounce and other multithreaded optimization issues are very high-level topics that require a full tutorial on them. The best thing I can do here is to make you aware of the existence of these problems.

Other packaging technology

When reordering is combined with other techniques, it works best to slim down your body. If you have a number of Boolean flags in a struct, for example, consider reducing them to 1 bit fields and encapsulating them in a structure that would have been a "puddle" place.

To do this, you will encounter some disadvantage of access time-but if it squeezes the work area small enough, these disadvantages will be obscured by the benefits of avoiding cache misses.

More generally, look for ways to reduce the size of the data field. In Cvs-fast-export , for example, I used a compression technique that used knowledge that the RCS and CVS code libraries did not exist until 1982. I reduced the 64-bit Unix time_t(1970 as the starting 0 days period) to 32-bit time offset from 1982-01-01t00:00:00, which overwrites the date 2,118 years ago. (Note: If you're going to play this trick, whenever you want to set a field , you have to do a border check to prevent nasty mistakes!) )

Each of these narrowed domains not only reduces the size of your structure, but also eliminates "puddles" and/or creates additional opportunities to get the benefits of domain reordering. The benign superposition of these effects is not difficult to obtain.

The most risky form of encapsulation is the use of a consortium. If you know that a particular domain in your structure will never be used for a combination with other specific domains, consider using a consortium to allow them to share storage space. But you need to be extra careful and use regression testing to verify your work, because if your life cycle analysis even has minor errors, you will get a variety of program vulnerabilities, from program crashes to (worse) hard-to-find data corruption.

Tools

The C language compiler has a-wpadded option that enables it to generate messages about the alignment of holes and fills.

Although I didn't use it myself, some of the respondents praised a program called Pahole . This tool works with compilers to generate reports about your structure, describing padding, alignment, and cache line boundaries.

Certifications and Exceptions

You can download the code for a small program that shows the arguments for the scalar and struct sizes described above. Is packtest.c.

If you browse enough compilers, options, and strange combinations of uncommon hardware, you'll find exceptions to some of the rules I've told. If you go back to the older processor design, the more common it will be.

More than knowing these rules, is knowing how and when these rules will be broken. In the years that I studied them (the early 1980), we referred to those who did not understand them as victims of the "World is VAX syndrome". Remember that there is not only pc in the world.

Version History

1.5 @ 2014-01-03 explains why you do not automate the reordering of struct members.

1.4 @ 2014-01-06 Note on double precision under X86linux.

1.3 @ 2014-01-03 A paragraph about hard-to-handle scalar instances, readability, and caching locality and tools.

1.2 @ 2014-01-02 Fixed an incorrect address calculation.

1.1 @ 2014-01-01 explains why access to alignment is faster. Mention Offsetof. Various minor fixes, including download links for packtest.c.

1.0 @ 2014-01-01 First Edition


For more discussion and exchange on the cornerstone and practice of program design, please follow this blog and Sina Weibo songzi_tea.

Design Cornerstone and Practice series of the Lost C language Structure packaging art

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.