I-faster than I ++?

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

On Weibo today, I was told that I-is faster than I ++.ProgramAfter testing, it was really fast. Isn't subtraction faster than addition? It seems impossible to analyze the principle, So I studied it in depth and finally found the cause.

Take a look at the testCode:

 # Include <stdio. h> # include <time. h>IntMain (){IntCount = 1000000000; clock_t Cl = clock ();For(IntI = count; I> 0; I --) {} printf ("Elapse % u Ms \ r \ n", (Clock ()-Cl); CL = clock ();For(IntI = 0; I <count; I ++) {} printf ("Elapse % u Ms \ r \ n", (Clock ()-Cl ));Return0 ;}

The above code is compiled in VC 2008 and the optimization option is canceled during compilation (if optimization is not canceled, the above two loop statements will be optimized by the compiler because nothing is done ).

The result after running is
Elapse 2267 MS
Elapse 2569 MS

That is to say, the subtraction cycle is 1 billion milliseconds faster than the addition cycle of 300 times, and exceeds 10%.

From the perspective of C language analysis, the two codes are almost the same. At the beginning, I was stunned for more than one minute. Later, I carefully compared the two codes and felt that their differences were mainly in two places, one is the difference between addition and subtraction, and the other is the for loop. The other is the comparison with the immediate number and the variable. With my knowledge of computer hardware principles, I first ruled out the possibility of performance impact caused by the first difference. Then the problem is probably due to the second difference, because I know that the two memory variables in the assembly language cannot be directly compared, the middle must be dumped once through the Register. In this way, at least one additional command is generated. The problem may be located here. To verify my judgment, let's take a look at what the Assembly statements of the above Code look like:

 1:IntMain ()

2:{

 3:00cc1000 push EBP

 4:00cc1001 mov EBP, ESP

 5:00cc1003 sub ESP, 10 h

6:

 7:IntCount = 1000000000;

8:00cc1006 mov dword ptr [count], 3b9aca00h

9:

10:

 11:Clock_t Cl = clock ();

 12:00cc100d call dword ptr [_ imp _ clock (0cc209ch)]

 13:00cc1013 mov dword ptr [Cl], eax

14:

 15:For(IntI = count; I> 0; I --)

 16:00cc1016 mov eax, dword ptr [count]

 17:00cc1019 mov dword ptr [I], eax

18:00cc101c JMP main + 27 h (0cc1027h)

 19:00cc101e mov ECx, dword ptr [I] // copy the memory value of I to the ECX register.

 20:00cc1021 sub ECx, 1 // ECx minus 1

 21:00cc1024 mov dword ptr [I], ECx // copy the ECX value to the memory address corresponding to I. Here I -- operation is completed.

 22:00cc1027 cmp dword ptr [I], 0 // The memory value corresponding to I is compared with 0

 23:00cc102b jle main + 2fh (0cc102fh) // if the value is smaller than or equal to 0, the page jumps to 98 rows.

 24:{

 25:}

 26:00cc102d JMP main + 1eh (0cc101eh) // if the value is greater than 0, jump to 19 rows and continue the loop

27:

28:Printf ("Elapse % u ms", (Clock ()-Cl ));

 29:00cc102f call dword ptr [_ imp _ clock (0cc209ch)]

 30:00cc1035 sub eax, dword ptr [Cl]

 31:00cc1038 push eax

 32:00cc1039 push offset ___ xi_z + 30 h (0cc20f4h)

 33:00cc103e call dword ptr [_ imp _ printf (0cc20a4h)]

 34:00cc1044 add ESP, 8

35:

 36:CL = clock ();

 37:00cc1047 call dword ptr [_ imp _ clock (0cc209ch)]

38:00cc104d mov dword ptr [Cl], eax

39:

 40:For(IntI = 0; I <count; I ++)

 41:00cc1050 mov dword ptr [I], 0

 42:00cc1057 JMP main + 62 h (0cc1062h)

 43:00cc1059 mov edX, dword ptr [I] // copy the memory value of I to the Register edX

 44:00cc105c add edX, 1 // edX plus 1

 45:00cc105f mov dword ptr [I], EDX // copy the edX value to the address corresponding to the I variable

 46:00cc1062 mov eax, dword ptr [I] // copy the I variable value to the Register eax

47:00cc1065 CMP eax, dword ptr [count] // compare with values on eax and count addresses

 48:00cc1068 jge main + 6ch (0cc0000ch) // if the value is greater than or equal to count, the loop jumps out.

 49:{

 50:}

 51:00cc106a JMP main + 59 H (0cc1059h) // otherwise, jump to 43 rows and continue the loop.

I marked the loop part in the Assembly statement in red and added comments. We can clearly see that there are 7 assembly commands in the second loop, and 6 in the first one, that is, the first one is about 1/7 faster than the second one, this is basically consistent with the actual test results.

Then let's look at why the compiler requires more machine commands. The reason is that the compilation statement cannot directly compare the two memory values. The memory values can only be compared with the registers. This should be determined by the computer hardware structure, this problem causes the compiler to add an instruction to dump the memory value to the Register.

Further, we find that the compiler seems stupid. If we copy the dword ptr [count] to a register before the loop, for example, ECx, and then directly CMP ECx on 46 rows, dword ptr [I], you do not need the 47th-line command. But in fact the compiler may not be so stupid. As mentioned earlier in this article, I disabled the optimization of the compiler, because if it is optimized, the above two for loops will be completely ignored, it is not executed at all. The test time is 0 seconds. So since we tell the compiler not to optimize, the compiler will not optimize this instruction. If the optimization is done according to the above method, in the debugging environment, if we want to change the Count value in a loop, it will be more difficult. We need a debugger to do some compiler tasks.

Further, we will find that this Assembly statement can be optimized, that is

 21:00cc1024 mov dword ptr [I], ECx // copy the ECX value to the memory address corresponding to I. Here I -- operation is completed.

 22:00cc1027 cmp dword ptr [I], 0 // The memory value corresponding to I is compared with 0

Rows 22nd can be optimized to CMP ECx, 0

We know that reading and writing registers is the fastest, followed by first-level cache, second-level cache, third-level cache, then memory, and finally disk.

If 22 rows are optimized to CMP ECx, the running speed of 0 must be faster than that of cmp dword ptr [I] and 0, because the subsequent statements need to be addressed once, read data from the cache (if the CPU has a cache). If it is not cached, it is read from the memory once, which slows down.

Finally, we changed the I ++ loop

For (INT I = 0; I <1000000000; I ++) test again, the result is

Elapse 2334 MS
Elapse 2290 MS

We can see that the two cycles are basically equal.

 For(IntI = 0; I <1000000000; I ++) 01201050 mov dword ptr [I], 0 01201057 JMP main + 62 h (1201062 h) 01201059 mov edX, dword ptr [I] 0120105c add edX, 1 0120105f mov dword ptr [I], EDX 01201062 cmp dword ptr [I], 3b9aca00h 01201069 jge main + 6DH (1201_dh) {} 0120106b JMP main + 59 H (1201059 H)

Let's take a look at the Assembly statement. After the second sentence of the For Loop is changed to an immediate Number comparison, the Assembly statement is changed to six commands. So the time is basically the same.

Conclusion:

I ++ and I-there is no difference in performance. The reason we feel that I is faster is that at the Assembly level, I ++ has a machine instruction in that loop. In addition, through this article, we also learned some knowledge about assembly instruction optimization, hoping to help you.

Weibo: http://weibo.com/hubbledotnet

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

I-faster than I ++?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

I-faster than I ++?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support