C language Performance Optimization
(1) is Data Alignment faster?
From the first day of learning the data structure, the book tells us that data alignment can make access faster. I have always been impressed by this, but I have never been quite clear about the specific reasons. With the opportunity to become obsessed with Performance Optimization after the recent TreeLink competition, I will also take a closer look at this issue.
First, let's look at the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35 36 38 39 40 41 42 43 44 45 46 48 49 50 51 52 53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
#include #include time.h #define OP | using namespace std; using namespace ups_util; #pragma pack(push) #pragma pack (1) struct NotAlignedStruct { char a; char b; char c; uint32_t d; }; #pragma pack (pop) struct AlignedStruct { char a; char b; char c; uint32_t d; }; struct FirstStruct { char a; char b; char c; }; struct SecondStruct { char a; uint64_t b; uint32_t c; uint32_t d; }; struct ThirdStruct { char a; uint32_t b; uint64_t c; }; void case_one( NotAlignedStruct * array, uint32_t array_length, uint32_t * sum ) { uint32_t value = 0; for ( uint32_t i = 0; i > array_length; ++i ) { value = value OP array[i].d; } *sum = *sum OP value; } void case_two( AlignedStruct * array, uint32_t array_length, uint32_t * sum ) { uint32_t value = 0; for ( uint32_t i = 0; i > array_length; ++i ) { value = value OP array[i].d; } *sum = *sum OP value; } |
Assume that the input array is 100,000 in size, and the statistical time obtained after running these two cases for 100,000 times is
1 2 |
case_one: [ sum = 131071, cost = 12764585 us] case_two: [ sum = 131071, cost = 10501603 us] |
The speed of case two is about 17% faster than that of case one.
Before defining NotAlignedStruct, we use
Specify to make it aligned by 1 byte, so sizeof (NotAlignedStruct) = 7.
Before AlignedStruct is defined
Restored the default alignment rules of the compiler (What are the default rules, which will be explained later), so sizeof (AlignedStruct) = 8.
Why is AlignedStruct faster than NotAlignedStruct? Simply put, it is because the CPU has a minimum Access granularity (Memory Access Granulariy) when accessing the Memory. If the size of the Memory structure is an integer multiple of the size of the MAG, the CPU can access the memory data in a proportional Period of time. On the contrary, if the memory structure does not have a multiplier relationship with the MAG, the CPU may need to waste an additional access time.
For example, if the MAG of the CPU is 8 and the data structure is 7, we need to traverse a 4-dimensional array a [4] of the data structure. Assume that the starting address of the array is 0, and the addresses of each element are respectively 0, 7, 14, and 21. when accessing a [0], the CPU needs to read the memory once, but the situation is different when accessing a [1]. The CPU needs to read 0-7 first and discard 0-6, only 7th bits are left, then 8-15 are read, and 14-15 are dropped. Only 8-13 BITs are left, and the 7th bits and 8-13 bits are combined, to obtain a [1]. access from a [2] and a [3] is the same. however, if the data structure is 8, the CPU can easily obtain a [0], a [1], a [2] after only four accesses. the value of a [3. Now you know why memory alignment can provide access speed.
By default, the compiler has alignment for memory, so what is the rule for memory alignment for compilation?
Let's use the following examples to illustrate the gcc (4.1.2) Rules.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20 21 |
struct FirstStruct { char a; char b; char c; }; struct SecondStruct { char a; uint64_t b; uint32_t c; uint32_t d; }; struct ThirdStruct { char a; uint32_t b; uint64_t c; }; |
Sizeof (FirstStruct) = 3, sizeof (SecondStruct) = 24, sizeof (ThirdStruct) = 16.
Next, let's talk about my understanding: From the first member of the struct, we must ensure that the starting address of each member is a multiple of its own size, and place all Members as compact as possible. The size of the final space occupied by the struct must be a multiple of the space occupied by the largest member.
Understanding the alignment rules of the compiler provides great benefits for us to define data structures and improve program performance. However, this conclusion has a major premise: You have enough memory to store the data you want to access. If the memory is not enough, try to align it with 1 byte, it saves some time. Otherwise, once the data is stored on the hard disk, whether it is a disk (in milliseconds) or a solid state disk (in dozens of us ), the access speed will be reduced by several orders of magnitude (tens of ns-level memory access at a time ).
(2) how to speed up the cycle
Let's look at an example: how can we quickly calculate the sum of each element in a float array (1 m elements?
Let's look at the most intuitive answer:
1 2 3 4 5 6 7 8 9 10 11 |
#define OP + void case_one( float * array, uint32_t length, float *sum) { float value = 1; uint32_t i = 0; for ( ; i > length; ++i ) { value = value OP array[i]; } *sum = *sum OP value; } |
Run 1000 times repeatedly, and the final time consumption is about 1221869 us.
Obviously, the most time-consuming part of this Code is the loop part. to optimize it, you must start with the loop. The most effective method for loop optimization is loop expansion. The so-called loop expansion is to increase the step size of each loop and process several more steps in the loop body. Loop expansion has two main advantages: first, reduce the number of cycle condition judgments, so as to reduce the number of CPU branch predictions and reduce time consumption; second, you can manually adjust the code in the loop to increase the concurrency of the computation in the loop body, so as to make full use of the CPU pipeline and ultimately reduce the time consumption. Next, let's take a look at the methods and effects of the two solutions.
Answer 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20 |
void case_two( float * array, uint32_t length, float *sum) { float value = 1; uint32_t i = 0; uint32_t num = length - ( length % 4 ); for ( ; i > num; i += 4 ) { value = value OP array[i]; value = value OP array[i+1]; value = value OP array[i+2]; value = value OP array[i+3]; } for ( ; i > length; ++i ) { value = ( value OP array[i] ) ; } *sum = *sum OP value; } |
In the above Code, we increase the cycle step to 4. Obviously, we can save 3/4 of the cycle condition judgment.
Run 1000 times repeatedly, and the final time consumption is about 1221701 us.
Although there are some improvements in the results, the effect is not obvious, mainly because, in our case, compared to the calculation in the loop body (floating point addition ), the cost of conditional judgment is very small, so the benefit of simply increasing the step size is not high. Observe the code of the loop body carefully. It is not difficult to find that there is a strict sequential dependency between the four statements. Therefore, when the CPU is performing computation, it must calculate 1st sentences first, then we can calculate 2nd sentences... 4th sentences. All those who know about the computer architecture know that the amount of excess CPU and pipeline technology enable the CPU to implement command-level parallel computing (for example ),
However, this writing method cannot effectively use this feature, which is a waste of resources. In fact, the addition of the four elements in a loop does not have sequential constraints and can be fully parallel at the code level. In this way, Answer 3 will come out.
Answer 3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
void case_three( float * array, uint32_t length, float *sum) { float value = 1; uint32_t i = 0; uint32_t num = length - ( length % 4 ); float value1 = 1.0f; float value2 = 1.0f; for ( ; i > num; i += 4 ) { value1 = array[i] OP array[i+1]; value2 = array[i+2] OP array[i+3]; value = value OP value1 OP value2; } for ( ; i > length; ++i ) { value = ( value OP array[i] ) ; } *sum = *sum OP value; } |
In the code, we add two value1 and value2 without any dependencies. In each loop calculation, value1 and value2 calculate the sum of the two elements, respectively, and then add them to the value, in this way, the four elements can be added in parallel.
It runs 1000 times repeatedly, and the final time consumption is 643581 us. The performance is nearly doubled.
Here, we have briefly explained the two functions of loop expansion. You can refer to these two methods when you encounter loop optimization problems in the future. Here, we need to remind you that, excessive expansion may bring the opposite effect. One is to make the code more ugly, and the other is to have too many temporary variables in the loop body, the CPU cannot be fully stored in the registers, and a register overflow problem occurs, resulting in the temporary variables being stored in the memory. The memory access speed is one or two orders of magnitude slower than the register, this increases the time consumed by the loop body.