This article describes how to optimize Dead Code Elimination, which is abbreviated as DCE. As the name implies: as long as the computing result is not used by the program, the computation is discarded.
At this time, you may say that your code will only calculate useful results. From nothing useless, only idiots will add useless code without reason-for example, while doing some useful things, we are still calculating the first 1000 digits of the circumference rate. So when will optimization to eliminate redundant code be useful?
The reason why I started to talk about DCE optimization so early is that, if I am not clear about DCE, it will cause some damage and confusion in exploring other more interesting optimization methods. Let's take a look at the following small example: file Sum. cpp:
- int main() {
- long long s = 0;
- for (long long i = 1; i <= 1000000000; ++i) s += i;
- }
We are very interested in the execution speed of loops when computing the billions. Indeed, this is too stupid. We learned in high school that there is a closed formula to calculate the results, but this is not the point)
Use commandsCL/Od/FA Sum. cppAnd run the program with the Sum command. Note that code optimization is disabled for this build using the/Od switch. It took 4 seconds to run the program on my PC. Now we try to use CL/O2/FA Sum. cpp to compile the optimized code. This operation was very fast and there was almost no latency. Is the compiler doing so well in optimizing our code? The answer is no, but it does change our code in a strange way)
Let's take a look at the code generated by/Od, Which is saved in Sum. asm. I reduced some code and commented out some text so that it only displays the loop body:
These commands are similar to what you expected. Variable I is saved on the stack where RSP is the register and I $1 is the offset. elsewhere in the asm file, we find that I $1 = 0. Use the RAX register to increase I. Similarly, the variable s is saved on the stack where RSP is the register and S $ is the offset. s $ = 8. Then, the accumulation and sum of each loop are calculated in RCX.
We noticed that each cycle first obtains the I value from the stack and then writes the new value back. The variable s is the same. It can be said that this code is naive-it is generated by a very stupid compiler, that is, optimization is disabled ). For example, we can keep the variables I and s in the registers without accessing the memory every iteration.
There are so many unoptimized codes. What is the generated code after optimization? Let's take a look at the Sum. asm file corresponding to the program built with/O2, and streamline the file to the implementation of only the loop body,
The result is:
- ;; there’s nothing here!
Yes, it is empty and there is no instruction to calculate s.
You may say that this answer is definitely wrong. But how do we know this answer is wrong? The optimizer has inferred that the program does not use S at any time, so it is too lazy to calculate it. You cannot say the answer is wrong unless you need to check the answer, right?
Isn't it because we are being optimized by DCE? If you do not need to observe the calculation result, the program will not perform the calculation.
The optimizer's problem is actually similar to the basic principle of quantum physics. It can be explained in a sentence that is often mentioned in popular science articles. "If a tree falls down in the forest, but if no one is around, will it still sound? ".
We can add the statement to print the variable s in the code to observe the calculation result. The Code is as follows:
- #include <stdio.h>
- int main() {
- long long s = 0;
- for (long long i = 1; i <= 1000000000; ++i) s += i;
- printf("%lld ", s);
- }
It takes 4 seconds for the program running/Od version to print the correct result. The/O2 version prints the same result, but the speed is much faster, you can see the following optional parts. In fact, the speed is as high as seven times ).
So far, I have already told you the main point of view in this article: Be very careful when performing Compiler Optimization analysis. Do not be misled by DCE when measuring their advantages. The following are the four steps to use DCE optimization:
In any case, we have learned some interesting things from this example. The following four sections are optional.
- xor edx, edx
- mov eax, 1
- mov ecx, edx
- mov r8d, edx
- mov r9d, edx
- npad 13
- $LL3@main:
- inc r9
- add r8, 2
- add rcx, 3
- add r9, rax ;; r9 = 2 8 18 32 50 ...
- add r8, rax ;; r8 = 3 10 21 36 55 ...
- add rcx, rax ;; rcx = 4 12 24 40 60 ...
- add rdx, rax ;; rdx = 1 6 15 28 45 ...
- add rax, 4 ;; rax = 1 5 9 13 17 ...
- cmp rax, 1000000000 ;; i <= 1000000000 ?
- jle SHORT $LL3@main ;; yes, so loop back
Note that the loop body contains the same number of commands as the unoptimized version. Why is it much faster? This is because the optimized cyclic body instructions use registers instead of memory addresses. We all know that register access is much faster than memory access. The following latency shows how to reduce your program to a snail like speed during memory access:
| Location |
Latency |
| Register |
1 cycle |
| L1 |
4 cycles |
| L2 |
10 cycles |
| L3 |
75 cycles |
| DRAM |
60 ns |
Therefore, the unoptimized version reads and writes on the stack, which is a little slower than the Register's access time ).
But there are other reasons. Note that when the/Od version executes the loop, the counter is added with 1 each time, and the/O2 version counter is saved in the RAX register.) 4 is added each time.
The optimizer has expanded the loop and adds up four items in each iteration, as shown in the following figure:
s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .
By expanding this loop, we can see that every four iterations make a judgment on the loop, instead of making a judgment every time, so that the CPU can save more time to do some useful things, instead of making loop judgments constantly.
In addition, it does not store the results in one place, but uses four independent registers to separately sum the results, as shown in the following code:
RDX = 1 + 5 + 9 + 13 + ... = 1, 6, 15, 28 ...
R9 = 2 + 6 + 10 + 14 + ... = 2, 8, 18, 32 ...
R8 = 3 + 7 + 11 + 15 + ... = 3, 10, 21, 36 ...
RCX = 4 + 8 + 12 + 16 + ... = 4, 12, 24, 40 ...
When the loop ends, add the four registers to get the final result.
Readers can think about this exercise. If the total number of cycles is not a multiple of 4, what will the optimizer do ?)
Option 2: accurate performance testing
Previously, I said in a/O2 program that didn't use the printf function, "the speed is so fast that you don't notice any latency ", the following uses an example to describe this statement more accurately:
- #include <stdio.h>
- #include <windows.h>
- int main() {
- LARGE_INTEGER start, stop;
- QueryPerformanceCounter(&start);
- long long s = 0;
- for (long long i = 1; i <= 1000000000; ++i) s += i;
- QueryPerformanceCounter(&stop);
- double diff = stop.QuadPart - start.QuadPart;
- printf("%f", diff);
- }
QueryPerformanceCounter is used in the program to calculate the running time. This is the high-resolution timer of the simplified version I wrote in my previous blog ). When measuring performance, you must keep in mind some precautions I have previously written a list). However, this special example is actually useless. We will see it in a moment:
I run the/Od program on a PC and print the diff value, which is about 7 million. The unit of the calculation result is not important. You only need to know that the larger the value, the longer the program runs ). In/O2, the diff value is 0, which is due to DCE optimization.
To prevent DCE, we add a printf function. The diff value of the/Od version is about 1 million-the speed is increased by seven times.
Optional 3: x64Assembler"Extension"
Let's look back at the assembly code section in the article. It may be a bit strange in the initialization Register Section:
- xor edx, edx ;; rdx = 0 (64-bit!)
- mov eax, 1 ;; rax = i = 1 (64-bit!)
- mov ecx, edx ;; rcx = 0 (64-bit!)
- mov r8d, edx ;; r8 = 0 (64-bit!)
- mov r9d, edx ;; r9 = 0 (64-bit!)
- npad 13 ;; multi-byte nop alignment padding
- $LL3@main:
Remember that the original C ++ language uses the long type variable to save the cyclic counter and sum. In the VC ++ compiler, it will be mapped to a 64-bit integer, so we will expect that the generated code will use the 64-bit register of x64.
In the previous article, I already talked about instructions.Xor reg,RegIs an efficient method to set the reg value to 0. However, the first command is to perform xor operations on the low 32-bit bytes of the EDX register. The next command is to assign the value of EAX, that is, the low 32-bit bytes of the RAX register, to 1. The following three commands are the same method. On the surface, the high 32-bit bytes of each target register store a random number, and the computing part of the loop body is carried out on the extended 64-bit register, how can this calculation result be correct?
The answer is that the x64 bit instruction set originally released by AMD automatically expands the 32-bit high byte of the 64-bit destination register to zero. The following are two knowledge points in section 3.4.5 of this manual:
1. Zero extension of 32-bit registers: if the register is 32-bit, the high 32-bit extension of the general-purpose register is automatically zero.
2. 8-bit and 16-bit registers are not extended: for 8-bit and 16-bit registers, the 64-bit general-purpose registers are not changed.
Finally, note that the npad 13 command is actually a pseudo operation, an assembly command ). It is used to ensure that the next instruction starts from the loop body) follows 16-byte memory alignment and can improve performance (sometimes used in microarchitecture ).
Optional 4:PrintfAnd std: out
You may ask, in the previous experiment, why did I use C's printf function instead of C ++'s std: out? You can try both, but the asm file generated by the latter is much larger, so it is not convenient to browse: compared with the previous 1.7K file, the file generated by the latter is 0.7 MB in size.
Http://blog.jobbole.com/47231/.