108 odd tricks for Linux programming-4 (Compilation)

Source: Internet
Author: User

Today, we will continue to introduce 4th common skills and compile the Code. For more information about the code, see the notes at the end of this article.

We know that the compiler will automatically expand the loop, but how to expand it is completely uncontrollable. If we code the code on our own, the code will look bloated and ugly, and the code will be repeated in a large segment, therefore, this article introduces a common method to expand the code. You can write it to any level. Here I only write it to do16. See the code bold Section.

The experiment shows that after manual expansion, the compilation Optimization of O3 can still be faster than manual expansion. you can experiment with the following code.

Readers may ask how much time it can save and how much benefit it will be. Please experiment with your friends to verify that everything comes from practice. If you read high-quality open-source code frequently, you will often see this technique. I hope you can use it more in your work after understanding it.

If you are interested, you can expand it at different levels (,) to see how much is the best and explain the cause. Congratulations, your realm has taken a new step.

 

Note: 1) Some previously introduced code will be used in it. Therefore, it is recommended that new readers read the previous series. The link is at the end of this article.

2) For more information, see:Http://blog.csdn.net/pennyliang/archive/2010/10/30/5975678.aspx

 

After compiling with-O3, use objdump-D test_m1_o3 to observe the Code situation. [The code between two rdtsc segments is the code of the main computing process]

 
400730: 83 FD 02's
CMP $0x2, % EBP

 
400733: 89 C6
MoV % eax, % ESI

 
400735: 4C 8d 63 FC
Lea 0 xfffffffffffffc (% RBx), % R12 // 0xfffffffffffffffc is in hexadecimal format-4. In the future, a special blog will introduce the specific meaning of this Code, which is not described in this article.

 
400739: 7E 21
Jle 40075c <main + 0xac>

 
40073b: 8d 45 FD
Lea 0 xfffffffffffffd (% RBP), % eax

 
40073e: 4C 8d 63 FC
Lea 0 xfffffffffffffc (% RBx), % R12

 
400742: 31 D2
XOR % edX, % edX

 
400744: 48 8d 48 01
Lea 0x1 (% Rax), % rcX

 
400748: 8B 44 93 04
MoV 0x4 (% RBx, % RDX, 4), % eax

 
40074c: 03 04 93
Add (% RBx, % RDX, 4), % eax // The loop is not expanded

 
40074f: 89 44 93 08
MoV % eax, 0x8 (% RBx, % RDX, 4)

 
400753: 48 83 C2 01
Add $0x1, % RDX // equivalent to I ++

 
400757: 48 39 ca
CMP % rcX, % RDX

 
40075a: 75 EC
JNE 400748 <main + 0x98>


Use objdump-D test_m3_o3 to observe the Code [code between two rdtsc segments is the code of the main calculation process]

400726: 89 C7 mov % eax, % EDI
400728: 8d 45 0f Lea 0xf (% RBP), % eax
40072b: 85 ed test % EBP, % EBP
40072d: 89 EA mov % EBP, % edX
40072f: 4D 8d 6C 24 FC Lea 0 xfffffffffffffc (% R12), % R13
400734: Be 02 00 00 00 mov $0x2, % ESI
400739: 0f 48 D0 cmovs % eax, % edX
40073c: C1 fa 04 SAR $0x4, % edX
40073f: 83 fa 02 CMP $0x2, % edX
400742: 7E 79 jle 4007bd <main + 0x10d>
400744: 4D 8d 6C 24 FC Lea 0 xfffffffffffffc (% R12), % R13
400749: Be 02 00 00 00 mov $0x2, % ESI
40074e: 66 90 xchg % ax, % ax
400750: 8B 43 04 mov 0x4 (% RBx), % eax // eax is a accumulator, you can see the obvious code Expansion
400753: 03 03 add (% RBx), % eax
400755: 83 C6 10 Add $0x10, % ESI
400758: 89 43 08 mov % eax, 0x8 (% RBx)
40075b: 03 43 04 add 0x4 (% RBx), % eax
40075e: 89 43 0C mov % eax, 0xc (% RBx)
400761: 03 43 08 add 0x8 (% RBx), % eax
400764: 89 43 10 mov % eax, 0x10 (% RBx)
400767: 03 43 0C add 0xc (% RBx), % eax
40076a: 89 43 14 mov % eax, 0x14 (% RBx)
40076d: 03 43 10 Add 0x10 (% RBx), % eax
400770: 89 43 18 mov % eax, 0x18 (% RBx)
400773: 03 43 14 add 0x14 (% RBx), % eax
400776: 89 43 1C mov % eax, 0x1c (% RBx)
400779: 03 43 18 Add 0x18 (% RBx), % eax
40077c: 89 43 20 mov % eax, 0x20 (% RBx)
40077f: 03 43 1C add 0x1c (% RBx), % eax
400782: 89 43 24 mov % eax, 0x24 (% RBx)
400785: 03 43 20 Add 0x20 (% RBx), % eax
400788: 89 43 28 mov % eax, 0x28 (% RBx)
40078b: 03 43 24 add 0x24 (% RBx), % eax
40078e: 89 43 2C mov % eax, 0x2c (% RBx)
400791: 03 43 28 add 0x28 (% RBx), % eax
400794: 89 43 30 mov % eax, 0x30 (% RBx)
400797: 03 43 2C add 0x2c (% RBx), % eax
40079a: 89 43 34 mov % eax, 0x34 (% RBx)
40079d: 03 43 30 add 0x30 (% RBx), % eax
4007a0: 89 43 38 mov % eax, 0x38 (% RBx)
4007a3: 03 43 34 add 0x34 (% RBx), % eax
4007a6: 89 43 3C mov % eax, 0x3c (% RBx)
4007a9: 03 43 38 add 0x38 (% RBx), % eax
4007ac: 89 43 40 mov % eax, 0x40 (% RBx)
4007af: 03 43 3C add 0x3c (% RBx), % eax
4007b2: 89 43 44 mov % eax, 0x44 (% RBx)
4 PENNY: 48 83 C3 40 add $0x40, % RBx
4007b9: 39 F2 CMP % ESI, % edX
4007bb: 7f 93 JG 400750 <main + 0xa0>
4007bd: 39 F5 CMP % ESI, % EBP
4007bf: 7E 27 jle 4007e8 <main + 0x138>
4007c1: 48 63 C6 movslq % ESI, % Rax
4007c4: 48 C1 E0 02 SHL $0x2, % Rax
4 Liang: 49 8d 4C 05 00 Lea 0x0 (% R13, % rax, 1), % rcX
4007cd: 49 8d 54 04 F8 Lea 0xfffffffffffff8 (% R12, % rax, 1), % RDX
4007d2: 8B 01 mov (% rcX), % eax
4007d4: 03 02 add (% RDX), % eax
4007d6: 83 C6 01 add $0x1, % ESI
4007d9: 48 83 C1 04 add $0x4, % rcX
4007dd: 89 42 08 mov % eax, 0x8 (% RDX)
4007e0: 48 83 C2 04 add $0x4, % RDX
4007e4: 39 F5 CMP % ESI, % EBP
4007e6: 7f ea jg 4007d2 <main + 0 x122>

 

Obviously, we can see the expanded Code. In addition, we can also see this difference from the size of the compiled executable program. The more in-line expanded code, the larger the executable program.

 

------------------------------------------ Compilation method ---------------------------------

Compile in debug mode

G ++-g test. cpp-O test_m1-D M_1

G ++-g test. cpp-O test_m2-D M_2

G ++-g test. cpp-O test_m3-D M_3

 

Optimized compilation under-O3 conditions:

G ++-O3 test. cpp-O test_m1_o3-D M_1

G ++-O3 test. cpp-O test_m2_o3-D M_2

G ++-O3 test. cpp-O test_m3_o3-D M_3

 

----------------------------------------- Running method -------------------------------------

./Test_m1 1000000 // calculate the Fibonacci number of 1000000

./Test_m2 1000000

./Test_m3 1000000

------------------------------------------- Code -------------------------------------

# Include <stdio. h>
# Include <stdlib. h>
# Include <string. h>
# Define do (x) x
# Define do4 (x) x
# Define do8 (x) do4 (x) do4 (X)
# Define do16 (x) do8 (x) do8 (X)
Const int max = 512*1024*1024;
Const float cpu_mhz = 3000.164; // use CAT/proc/cpuinfo get the value
Const float cpu_tick_count_per_msecond = cpu_mhz * 1000;
# If defined (_ i386 __)

Static _ inline _ unsigned long rdtsc (void)
{
Unsigned long int X;
_ ASM _ volatile ("rdtsc": "= A" (x ));
Return X;
}
# Elif defined (_ x86_64 __)
Static _ inline _ unsigned long rdtsc (void)
{
Unsigned hi, lo;
_ ASM _ volatile _ ("rdtsc": "= A" (LO), "= D" (HI ));
Return (unsigned long) LO) | (unsigned long) HI) <32 );
}

# Endif

Int main (INT argc, char ** argv)
{
If (argc! = 2)
{
Printf ("command Penny line: Test N, N no more than % d/N", max );
Return 0;
}

Int * f = (int *) malloc (max * sizeof (INT ));
Memset (F, 0, Max * sizeof (INT); // just warm up cache, to make calculate more accurate !!!

F [0] = 1;
F [1] = 1;
Int FX = atoi (argv [1]);
Int start = 0;
Int end = 0;
Start = rdtsc ();
# Ifdef M_1
For (INT I = 2; I <FX; ++ I) // calculate progressively by means of a loop
{
F [I] = f [I-1] + F [I-2];
}
# Endif

# Ifdef M_2
Int r = FX % 4;
Int idx = FX/4;
Int I = 2;

Int J = 0;
For (; j <idx; ++ J)
{
Do4 (F [I] = f [I-1] + F [I-2]; I ++;); // develops 4 pieces of code by means of cyclic expansion, circular scale reduced to 1/4 of the original
}
For (; I <FX; I ++)
{
F [I] = f [I-1] + F [I-2];
}
# Endif

# Ifdef M_3
Int r = FX % 16;
Int idx = FX/16;
Int I = 2;

Int J = 0;
For (; j <idx; ++ J)
{
Do16 (F [I] = f [I-1] + F [I-2]; I ++;); // unfold to 16 segments of code
}
For (; I <FX; I ++)
{
F [I] = f [I-1] + F [I-2];
}
# Endif
End = rdtsc ();
Printf ("Run tick count: % d/N", (end-Start ));
Printf ("RET: % d/N", F [Fx-1]);
Free (f );
Return 0;
}

 

For more in-depth discussions, refer to the following articles:Http://blog.csdn.net/pennyliang/archive/2010/10/30/5975678.aspx

 

Other recommended articles for the odd sex clever design series:

Http://blog.csdn.net/pennyliang/category/746545.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.