Swifter C # inline or not, which is a problem

Last Update:2018-12-08 Source: Internet

Author: User

Tags mul

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If the problem is how C # can be as fast as C ++, then the real problem is where C # is slow. Inline is one of the many factors that affect C # performance. If a large number of small functions that are frequently called are not inline, the performance will be greatly affected, because the time for Stack creation, stack deletion, stack pressure, and redirection is likely to be longer than the time for actually executing the function body.

In practical applications, Milo Yip's "C ++/C #/F #/Java/JS/Lua/Python/Ruby rendering comparison" is a good example, typical computation-intensive applications include a large number of small function calls for vector computing. Result C # is disappointing, and the performance lags behind VC ++ by more than doubled, even if I change it to the form of struct out ref (see the Milo article for code) although the performance is slightly improved, the gap is still large. First, I wonder if this performance difference is caused by the absence of inline small functions in. NET CLR. If you have a good understanding of the practice, you can debug it quickly. If you do not know how to read the ASM generated by JIT, you can read Clayman's article. The result is I guess it is wrong. NET's JIT compiler has inline these functions. The following sections describe how to calculate the volume by component:

Vec. mul (out rad, ref f, ref rad );
0000067e limit qword ptr [ebp-78h]
00000681 fmul qword ptr [ebp + FFFFFF58h]
00000687 fstp qword ptr [ebp + FFFFFF58h]
0000068d history qword ptr [ebp-70h]
00000690 fmul qword ptr [ebp + FFFFFF60h]
00000696 fstp qword ptr [ebp + FFFFFF60h]
0000069c rjqword ptr [ebp-68h]
0000069f fmul qword ptr [ebp + FFFFFF68h]
000006a5 fstp qword ptr [ebp + FFFFFF68h]

It doesn't seem to be because of performance differences caused by no inline, so you have to think deeply about the problem of inline. Not all functions will be inline. net jit inline rules. There must be a better way to understand than to roll the dice. Google found an article about. net clr Inline problems good article, "Inline or not to Inline: That is the question" blog owner Vance Morrison claims to be. NET Runtime architect. NET Runtime performance problems. It sounds awesome. His main points are as follows:

Inline is not always good. inline does reduce the total number of running commands. On the other hand, the code size will be increased, which may reduce the hit rate of the Instruction cache when the Code volume is large, if L1 cache misses the instruction to be read from L2, it will waste 3-10 clock cycles, and if L2 misses the instruction to be read from memory, it will waste more. In addition, a larger code size will reduce the program startup speed .. Net jit cancels hard rules for how many functions can be inline ,. NET project team should conduct a lot of experiments in line according to what situations, JIT does not have enough information to know the entire program running process in deciding whether to conduct inline, so the results will not always be correct, but the following is obvious:

1. If inline reduces the code size, it will certainly be inline. Note that we are talking about the size of Native code rather than the size of IL code.

2. The more frequently called functions, the more likely they will be inline for better performance. For example, calls within a loop are more likely to be inline than those outside the loop.

3. inline may lead to better optimization and more likely to be inline. For example, functions of value type parameters are more likely to be inline, because functions of inline value type parameters can usually bring better optimization results.

JIT uses the following heuristic algorithms to determine

1. Evaluate the call body size in non-inline situations.

2. evaluate the call body size in inline mode. This evaluation is based on IL. We use a simple state machine (Markov Model, which is an implicit Markov Model ), the evaluation logic is based on a large amount of measured data.

3. Calculate a coefficient. The default value is 1.

4. If the Code increases the coefficient in the loop. (5x)

5. (original article: Increase the multiplier if it looks like struct optimizations will kick in). it is not clear whether it is a structural optimization or a struct in the value type.

6. If the size of inline is <= the size of not inline *, inline is performed.

The conclusion is simple:

1. inline is transparent to C # and JIT will handle it. Trust the organization.

2. Small functions are more easily inline. Because the size of the Code is not significantly increased after the inline operation.

3. function calls in the loop body are more easily inline.

4. functions that use value type parameters are more likely to be inline.

I have verified the above points and the results are as follows:

1. It is true that in actual situations, the same function is usually inline in the loop but not outside. For example, the normal () function of the same vector.

public static void mul(out Vec result, ref Vec a, ref Vec b){    result.x = a.x * b.x;    result.y = a.y * b.y;    result.z = a.z * b.z;}public void normal(){    mul(out this, ref this, 1 / Math.Sqrt(x * x + y * y + z * z));}

In case A, there is no inline: The call starts with the main function, that is, the entire program runs only once:

Rd. normal ();
2017007d lea ecx, [ebp-40h]
00000080 call dword ptr ds: [00143978 h]

B. inline: calls are in the radiance function, while radiance is in multiple cycles of the main function:

U. normal ();
000003e9 rjqword ptr [ebp + ffff28h]
000003ef fmul st, st (0)
000003f1 rjqword ptr [ebp + ffff30h]
000003f7 fmul st, st (0)
000003f9 faddp st (1), st
000003fb rjqword ptr [ebp + ffff38h]
00000401 fmul st, st (0)
00000403 faddp st (1), st
00000405 fsqrt
00000407 fld1
00000409 fdivrp st (1), st
0000040b 10000st (0)
0000040d fmul qword ptr [ebp + FFFFFF28h]
00000413 fstp qword ptr [ebp + FFFFFF28h]
00000419 sort st (0)
0000041b fmul qword ptr [ebp + FFFFFF30h]
00000421 fstp qword ptr [ebp + FFFFFF30h]
00000427 fmul qword ptr [ebp + FFFFFF38h]
0000042d fstp qword ptr [ebp + ffff38h]

It can be seen that the function in the loop body is more likely to be inline, and the normal function is relatively large. Therefore, it is impossible to directly call a function to check whether the function is inline.

2. I tested. NET 4 CP and. NET 3.5 2.0, and found that the code generated by JIT inline is different. Take the call of the mul function as an example:

Code generated under. NET 2.0, 3.0, and 3.5

Vec. mul (out x, ref r. d, t );
1000000db lea ecx, [esp + 10 h]
201700df lea edx, [ebp + 8]
201700e2 cmp byte ptr [edx], al
201700e4 add edx, 18 h
201700e7 mov eax, edx
201700e9 rjqword ptr [esp]
201700ec fstp qword ptr [esp + 000003B0h]
201700f3 1_qword ptr [eax]
201700f5 fmul qword ptr [esp + 000003B0h]
201700fc fstp qword ptr [ecx]
201700fe rjqword ptr [eax + 8]
00000101 fmul qword ptr [esp + 000003B0h]
00000108 fstp qword ptr [ecx + 8]
0000010b rjqword ptr [eax + 10 h]
0000010e fmul qword ptr [esp + 000003B0h]
00000115 fstp qword ptr [ecx + 10 h]

Code generated under. NET 4
Vec. mul (out x, ref r. d, t );
201700d2 1_qword ptr [ebp-14h]
201700d5 lea eax, [ebp + 20 h]
201700d8 rjqword ptr [eax]
1000000da fmul st, st (1)
201700dc fstp qword ptr [ebp-30h]
1000000df lea eax, [ebp + 20 h]
201700e2 1_qword ptr [eax + 8]
1000000e5 fmul st, st (1)
201700e7 fstp qword ptr [ebp-28h]
1000000ea lea eax, [ebp + 20 h]
1000000ed 1_qword ptr [eax + 10 h]
1000000f0 fmulp st (1), st
201700f2 fstp qword ptr [ebp-20h]

The code generated by clr 4.0 and 2.0 is different. NET 4 JIT-generated Inline code is more efficient, which may explain why this test program has a large performance difference in 3.5 and 4, 3.5 takes 86 seconds, and 4 uses 67 seconds. I carefully checked the calls in the test program. All the computing functions that are frequently called in the loop are inline, and only those that run only once outside the loop are not inline, it seems that JIT works very well. We can safely hand over the work of inline to JIT.

Since it is not the performance problem caused by inline, why does C # lead to poor testing performance, is it because the two compilations of C # cannot carry out more in-depth and comprehensive optimization like C ++, or is it because of other reasons? We also need to continue exploring.

To be continued...

Code

                       Vec.mul(out rad, ref f, ref rad);
0000067e  fld         qword ptr [ebp-78h]
00000681  fmul        qword ptr [ebp+FFFFFF58h]
00000687  fstp        qword ptr [ebp+FFFFFF58h]
0000068d  fld         qword ptr [ebp-70h]
00000690  fmul        qword ptr [ebp+FFFFFF60h]
00000696  fstp        qword ptr [ebp+FFFFFF60h]
0000069c  fld         qword ptr [ebp-68h]
0000069f  fmul        qword ptr [ebp+FFFFFF68h]
000006a5  fstp        qword ptr [ebp+FFFFFF68h]

Http://www.cnblogs.com/miloyip/archive/2010/07/07/languages_brawl_GI.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More