I-faster than I ++?

Source: Internet
Author: User

On Weibo today, I was told that I-is faster than I ++.ProgramAfter testing, it was really fast. Isn't subtraction faster than addition? It seems impossible to analyze the principle, So I studied it in depth and finally found the cause.

Take a look at the testCode:

 
# Include <stdio. h> # include <time. h>IntMain (){IntCount = 1000000000; clock_t Cl = clock ();For(IntI = count; I> 0; I --) {} printf ("Elapse % u Ms \ r \ n", (Clock ()-Cl); CL = clock ();For(IntI = 0; I <count; I ++) {} printf ("Elapse % u Ms \ r \ n", (Clock ()-Cl ));Return0 ;}

 

The above code is compiled in VC 2008 and the optimization option is canceled during compilation (if optimization is not canceled, the above two loop statements will be optimized by the compiler because nothing is done ).

The result after running is
Elapse 2267 MS
Elapse 2569 MS

That is to say, the subtraction cycle is 1 billion milliseconds faster than the addition cycle of 300 times, and exceeds 10%.

From the perspective of C language analysis, the two codes are almost the same. At the beginning, I was stunned for more than one minute. Later, I carefully compared the two codes and felt that their differences were mainly in two places, one is the difference between addition and subtraction, and the other is the for loop. The other is the comparison with the immediate number and the variable. With my knowledge of computer hardware principles, I first ruled out the possibility of performance impact caused by the first difference. Then the problem is probably due to the second difference, because I know that the two memory variables in the assembly language cannot be directly compared, the middle must be dumped once through the Register. In this way, at least one additional command is generated. The problem may be located here. To verify my judgment, let's take a look at what the Assembly statements of the above Code look like:

 

 

 
1:IntMain ()
 
2:{
 
3:00cc1000 push EBP
 
4:00cc1001 mov EBP, ESP
 
5:00cc1003 sub ESP, 10 h
 
6: 
 
7:IntCount = 1000000000;
8:00cc1006 mov dword ptr [count], 3b9aca00h
 
9: 
 
10: 
 
11:Clock_t Cl = clock ();
 
12:00cc100d call dword ptr [_ imp _ clock (0cc209ch)]
 
13:00cc1013 mov dword ptr [Cl], eax
 
14: 
 
15:For(IntI = count; I> 0; I --)
 
16:00cc1016 mov eax, dword ptr [count]
 
17:00cc1019 mov dword ptr [I], eax
18:00cc101c JMP main + 27 h (0cc1027h)
 
19:00cc101e mov ECx, dword ptr [I] // copy the memory value of I to the ECX register.
 
20:00cc1021 sub ECx, 1 // ECx minus 1
 
21:00cc1024 mov dword ptr [I], ECx // copy the ECX value to the memory address corresponding to I. Here I -- operation is completed.
 
22:00cc1027 cmp dword ptr [I], 0 // The memory value corresponding to I is compared with 0
 
23:00cc102b jle main + 2fh (0cc102fh) // if the value is smaller than or equal to 0, the page jumps to 98 rows.
 
24:{
 
25:}
 
26:00cc102d JMP main + 1eh (0cc101eh) // if the value is greater than 0, jump to 19 rows and continue the loop
 
27: 
28:Printf ("Elapse % u ms", (Clock ()-Cl ));
 
29:00cc102f call dword ptr [_ imp _ clock (0cc209ch)]
 
30:00cc1035 sub eax, dword ptr [Cl]
 
31:00cc1038 push eax
 
32:00cc1039 push offset ___ xi_z + 30 h (0cc20f4h)
 
33:00cc103e call dword ptr [_ imp _ printf (0cc20a4h)]
 
34:00cc1044 add ESP, 8
 
35: 
 
36:CL = clock ();
 
37:00cc1047 call dword ptr [_ imp _ clock (0cc209ch)]
38:00cc104d mov dword ptr [Cl], eax
 
39: 
 
40:For(IntI = 0; I <count; I ++)
 
41:00cc1050 mov dword ptr [I], 0
 
42:00cc1057 JMP main + 62 h (0cc1062h)
 
43:00cc1059 mov edX, dword ptr [I] // copy the memory value of I to the Register edX
 
44:00cc105c add edX, 1 // edX plus 1
 
45:00cc105f mov dword ptr [I], EDX // copy the edX value to the address corresponding to the I variable
 
46:00cc1062 mov eax, dword ptr [I] // copy the I variable value to the Register eax
47:00cc1065 CMP eax, dword ptr [count] // compare with values on eax and count addresses
 
48:00cc1068 jge main + 6ch (0cc0000ch) // if the value is greater than or equal to count, the loop jumps out.
 
49:{
 
50:}
 
51:00cc106a JMP main + 59 H (0cc1059h) // otherwise, jump to 43 rows and continue the loop.

 

I marked the loop part in the Assembly statement in red and added comments. We can clearly see that there are 7 assembly commands in the second loop, and 6 in the first one, that is, the first one is about 1/7 faster than the second one, this is basically consistent with the actual test results.

Then let's look at why the compiler requires more machine commands. The reason is that the compilation statement cannot directly compare the two memory values. The memory values can only be compared with the registers. This should be determined by the computer hardware structure, this problem causes the compiler to add an instruction to dump the memory value to the Register.

Further, we find that the compiler seems stupid. If we copy the dword ptr [count] to a register before the loop, for example, ECx, and then directly CMP ECx on 46 rows, dword ptr [I], you do not need the 47th-line command. But in fact the compiler may not be so stupid. As mentioned earlier in this article, I disabled the optimization of the compiler, because if it is optimized, the above two for loops will be completely ignored, it is not executed at all. The test time is 0 seconds. So since we tell the compiler not to optimize, the compiler will not optimize this instruction. If the optimization is done according to the above method, in the debugging environment, if we want to change the Count value in a loop, it will be more difficult. We need a debugger to do some compiler tasks.

Further, we will find that this Assembly statement can be optimized, that is

 

 
21:00cc1024 mov dword ptr [I], ECx // copy the ECX value to the memory address corresponding to I. Here I -- operation is completed.
 
22:00cc1027 cmp dword ptr [I], 0 // The memory value corresponding to I is compared with 0

Rows 22nd can be optimized to CMP ECx, 0

We know that reading and writing registers is the fastest, followed by first-level cache, second-level cache, third-level cache, then memory, and finally disk.

If 22 rows are optimized to CMP ECx, the running speed of 0 must be faster than that of cmp dword ptr [I] and 0, because the subsequent statements need to be addressed once, read data from the cache (if the CPU has a cache). If it is not cached, it is read from the memory once, which slows down.

 

Finally, we changed the I ++ loop

For (INT I = 0; I <1000000000; I ++) test again, the result is

Elapse 2334 MS
Elapse 2290 MS

We can see that the two cycles are basically equal.

 
For(IntI = 0; I <1000000000; I ++) 01201050 mov dword ptr [I], 0 01201057 JMP main + 62 h (1201062 h) 01201059 mov edX, dword ptr [I] 0120105c add edX, 1 0120105f mov dword ptr [I], EDX 01201062 cmp dword ptr [I], 3b9aca00h 01201069 jge main + 6DH (1201_dh) {} 0120106b JMP main + 59 H (1201059 H)

 

Let's take a look at the Assembly statement. After the second sentence of the For Loop is changed to an immediate Number comparison, the Assembly statement is changed to six commands. So the time is basically the same.

 

Conclusion:

I ++ and I-there is no difference in performance. The reason we feel that I is faster is that at the Assembly level, I ++ has a machine instruction in that loop. In addition, through this article, we also learned some knowledge about assembly instruction optimization, hoping to help you.

 

Weibo: http://weibo.com/hubbledotnet

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.