Today, we will continue to introduce 4th common skills and compile the Code. For more information about the code, see the notes at the end of this article.
We know that the compiler will automatically expand the loop, but how to expand it is completely uncontrollable. If we code the code on our own, the code will look bloated and ugly, and the code will be repeated in a large segment, therefore, this article introduces a common method to expand the code. You can write it to any level. Here I only write it to do16. See the code bold Section.
The experiment shows that after manual expansion, the compilation Optimization of O3 can still be faster than manual expansion. you can experiment with the following code.
Readers may ask how much time it can save and how much benefit it will be. Please experiment with your friends to verify that everything comes from practice. If you read high-quality open-source code frequently, you will often see this technique. I hope you can use it more in your work after understanding it.
If you are interested, you can expand it at different levels (,) to see how much is the best and explain the cause. Congratulations, your realm has taken a new step.
Note: 1) Some previously introduced code will be used in it. Therefore, it is recommended that new readers read the previous series. The link is at the end of this article.
2) For more information, see:Http://blog.csdn.net/pennyliang/archive/2010/10/30/5975678.aspx
After compiling with-O3, use objdump-D test_m1_o3 to observe the Code situation. [The code between two rdtsc segments is the code of the main computing process]
400730: 83 FD 02's
CMP $0x2, % EBP
400733: 89 C6
MoV % eax, % ESI
400735: 4C 8d 63 FC
Lea 0 xfffffffffffffc (% RBx), % R12 // 0xfffffffffffffffc is in hexadecimal format-4. In the future, a special blog will introduce the specific meaning of this Code, which is not described in this article.
400739: 7E 21
Jle 40075c <main + 0xac>
40073b: 8d 45 FD
Lea 0 xfffffffffffffd (% RBP), % eax
40073e: 4C 8d 63 FC
Lea 0 xfffffffffffffc (% RBx), % R12
400742: 31 D2
XOR % edX, % edX
400744: 48 8d 48 01
Lea 0x1 (% Rax), % rcX
400748: 8B 44 93 04
MoV 0x4 (% RBx, % RDX, 4), % eax
40074c: 03 04 93
Add (% RBx, % RDX, 4), % eax // The loop is not expanded
40074f: 89 44 93 08
MoV % eax, 0x8 (% RBx, % RDX, 4)
400753: 48 83 C2 01
Add $0x1, % RDX // equivalent to I ++
400757: 48 39 ca
CMP % rcX, % RDX
40075a: 75 EC
JNE 400748 <main + 0x98>
Use objdump-D test_m3_o3 to observe the Code [code between two rdtsc segments is the code of the main calculation process]
400726: 89 C7 mov % eax, % EDI
400728: 8d 45 0f Lea 0xf (% RBP), % eax
40072b: 85 ed test % EBP, % EBP
40072d: 89 EA mov % EBP, % edX
40072f: 4D 8d 6C 24 FC Lea 0 xfffffffffffffc (% R12), % R13
400734: Be 02 00 00 00 mov $0x2, % ESI
400739: 0f 48 D0 cmovs % eax, % edX
40073c: C1 fa 04 SAR $0x4, % edX
40073f: 83 fa 02 CMP $0x2, % edX
400742: 7E 79 jle 4007bd <main + 0x10d>
400744: 4D 8d 6C 24 FC Lea 0 xfffffffffffffc (% R12), % R13
400749: Be 02 00 00 00 mov $0x2, % ESI
40074e: 66 90 xchg % ax, % ax
400750: 8B 43 04 mov 0x4 (% RBx), % eax // eax is a accumulator, you can see the obvious code Expansion
400753: 03 03 add (% RBx), % eax
400755: 83 C6 10 Add $0x10, % ESI
400758: 89 43 08 mov % eax, 0x8 (% RBx)
40075b: 03 43 04 add 0x4 (% RBx), % eax
40075e: 89 43 0C mov % eax, 0xc (% RBx)
400761: 03 43 08 add 0x8 (% RBx), % eax
400764: 89 43 10 mov % eax, 0x10 (% RBx)
400767: 03 43 0C add 0xc (% RBx), % eax
40076a: 89 43 14 mov % eax, 0x14 (% RBx)
40076d: 03 43 10 Add 0x10 (% RBx), % eax
400770: 89 43 18 mov % eax, 0x18 (% RBx)
400773: 03 43 14 add 0x14 (% RBx), % eax
400776: 89 43 1C mov % eax, 0x1c (% RBx)
400779: 03 43 18 Add 0x18 (% RBx), % eax
40077c: 89 43 20 mov % eax, 0x20 (% RBx)
40077f: 03 43 1C add 0x1c (% RBx), % eax
400782: 89 43 24 mov % eax, 0x24 (% RBx)
400785: 03 43 20 Add 0x20 (% RBx), % eax
400788: 89 43 28 mov % eax, 0x28 (% RBx)
40078b: 03 43 24 add 0x24 (% RBx), % eax
40078e: 89 43 2C mov % eax, 0x2c (% RBx)
400791: 03 43 28 add 0x28 (% RBx), % eax
400794: 89 43 30 mov % eax, 0x30 (% RBx)
400797: 03 43 2C add 0x2c (% RBx), % eax
40079a: 89 43 34 mov % eax, 0x34 (% RBx)
40079d: 03 43 30 add 0x30 (% RBx), % eax
4007a0: 89 43 38 mov % eax, 0x38 (% RBx)
4007a3: 03 43 34 add 0x34 (% RBx), % eax
4007a6: 89 43 3C mov % eax, 0x3c (% RBx)
4007a9: 03 43 38 add 0x38 (% RBx), % eax
4007ac: 89 43 40 mov % eax, 0x40 (% RBx)
4007af: 03 43 3C add 0x3c (% RBx), % eax
4007b2: 89 43 44 mov % eax, 0x44 (% RBx)
4 PENNY: 48 83 C3 40 add $0x40, % RBx
4007b9: 39 F2 CMP % ESI, % edX
4007bb: 7f 93 JG 400750 <main + 0xa0>
4007bd: 39 F5 CMP % ESI, % EBP
4007bf: 7E 27 jle 4007e8 <main + 0x138>
4007c1: 48 63 C6 movslq % ESI, % Rax
4007c4: 48 C1 E0 02 SHL $0x2, % Rax
4 Liang: 49 8d 4C 05 00 Lea 0x0 (% R13, % rax, 1), % rcX
4007cd: 49 8d 54 04 F8 Lea 0xfffffffffffff8 (% R12, % rax, 1), % RDX
4007d2: 8B 01 mov (% rcX), % eax
4007d4: 03 02 add (% RDX), % eax
4007d6: 83 C6 01 add $0x1, % ESI
4007d9: 48 83 C1 04 add $0x4, % rcX
4007dd: 89 42 08 mov % eax, 0x8 (% RDX)
4007e0: 48 83 C2 04 add $0x4, % RDX
4007e4: 39 F5 CMP % ESI, % EBP
4007e6: 7f ea jg 4007d2 <main + 0 x122>
Obviously, we can see the expanded Code. In addition, we can also see this difference from the size of the compiled executable program. The more in-line expanded code, the larger the executable program.
------------------------------------------ Compilation method ---------------------------------
Compile in debug mode
G ++-g test. cpp-O test_m1-D M_1
G ++-g test. cpp-O test_m2-D M_2
G ++-g test. cpp-O test_m3-D M_3
Optimized compilation under-O3 conditions:
G ++-O3 test. cpp-O test_m1_o3-D M_1
G ++-O3 test. cpp-O test_m2_o3-D M_2
G ++-O3 test. cpp-O test_m3_o3-D M_3
----------------------------------------- Running method -------------------------------------
./Test_m1 1000000 // calculate the Fibonacci number of 1000000
./Test_m2 1000000
./Test_m3 1000000
------------------------------------------- Code -------------------------------------
# Include <stdio. h>
# Include <stdlib. h>
# Include <string. h>
# Define do (x) x
# Define do4 (x) x
# Define do8 (x) do4 (x) do4 (X)
# Define do16 (x) do8 (x) do8 (X)
Const int max = 512*1024*1024;
Const float cpu_mhz = 3000.164; // use CAT/proc/cpuinfo get the value
Const float cpu_tick_count_per_msecond = cpu_mhz * 1000;
# If defined (_ i386 __)
Static _ inline _ unsigned long rdtsc (void)
{
Unsigned long int X;
_ ASM _ volatile ("rdtsc": "= A" (x ));
Return X;
}
# Elif defined (_ x86_64 __)
Static _ inline _ unsigned long rdtsc (void)
{
Unsigned hi, lo;
_ ASM _ volatile _ ("rdtsc": "= A" (LO), "= D" (HI ));
Return (unsigned long) LO) | (unsigned long) HI) <32 );
}
# Endif
Int main (INT argc, char ** argv)
{
If (argc! = 2)
{
Printf ("command Penny line: Test N, N no more than % d/N", max );
Return 0;
}
Int * f = (int *) malloc (max * sizeof (INT ));
Memset (F, 0, Max * sizeof (INT); // just warm up cache, to make calculate more accurate !!!
F [0] = 1;
F [1] = 1;
Int FX = atoi (argv [1]);
Int start = 0;
Int end = 0;
Start = rdtsc ();
# Ifdef M_1
For (INT I = 2; I <FX; ++ I) // calculate progressively by means of a loop
{
F [I] = f [I-1] + F [I-2];
}
# Endif
# Ifdef M_2
Int r = FX % 4;
Int idx = FX/4;
Int I = 2;
Int J = 0;
For (; j <idx; ++ J)
{
Do4 (F [I] = f [I-1] + F [I-2]; I ++;); // develops 4 pieces of code by means of cyclic expansion, circular scale reduced to 1/4 of the original
}
For (; I <FX; I ++)
{
F [I] = f [I-1] + F [I-2];
}
# Endif
# Ifdef M_3
Int r = FX % 16;
Int idx = FX/16;
Int I = 2;
Int J = 0;
For (; j <idx; ++ J)
{
Do16 (F [I] = f [I-1] + F [I-2]; I ++;); // unfold to 16 segments of code
}
For (; I <FX; I ++)
{
F [I] = f [I-1] + F [I-2];
}
# Endif
End = rdtsc ();
Printf ("Run tick count: % d/N", (end-Start ));
Printf ("RET: % d/N", F [Fx-1]);
Free (f );
Return 0;
}
For more in-depth discussions, refer to the following articles:Http://blog.csdn.net/pennyliang/archive/2010/10/30/5975678.aspx
Other recommended articles for the odd sex clever design series:
Http://blog.csdn.net/pennyliang/category/746545.aspx