0.033 seconds of art ---- traps in the xNa Math Library
 
 
 
For personal use only, do not reprint, do not use for any commercial purposes. 
 
 
 
Last timeDescribes how to view. netProgramASMCodeAnd analyzes some functions under system. Math. This time, we will take a closer look at how to efficiently use the library of mathematics in xNa. The following uses matrix and vector3 as an example. Other types can be pushed accordingly. For the purpose of testing and demonstration, I wrote a fairly "idiot" code:
 
 
 
Test 1:
 
  Code  
  AABB box  =     New  AABB (  New  Vector3 (  34.4f  ,  4  ,  23  ));
Vector3 Seed  = Box. Center;
 
  Public     class   AABB 
{< br>   private   vector3 center; 
   Public   vector3 center 
{< br>   Get  {  return   center ;} 
   set   {center  =  value ;} 
} 
PublicAABB (vector3 Center)
{
This. Center=Center;
}
}
 
 
The purpose of test 1 is to check whether JIT has simple attributes such as inline and vector3. Unfortunately, the answer is no.
 
 
Test 2:
 
 Code 
 Public     Static  Vector3 Foo (vector3 V,  Float  Radius)
{
Matrix;
Matrix. createtranslation (  Ref  V,  Out A );
 
 
 
Matrix B=Matrix. createtranslation (v );
 
Matrix. Multiply (RefA,RefB,OutA );
 
 
Vector3 F1=B. forward;
 
 
Vector3 F2;
F2.x=B. m31;
F2.y=B. M32;
F2.z=B. M33;
 
Vector3 F3;
Vector3.add (RefF1,RefF2,OutF3 );
Vector3.transform (RefF3,RefA,OutF1 );
ReturnF1;
}
 
 
Test 2 is the focus of this Article. To facilitate the discussion, we divide the code of Test2 into several sections:
 
 
Section 1:
 
Code 
     Public     Static  Vector3 Foo (vector3 V,  Float  Radius)
{
Matrix;
Matrix. createtranslation (  Ref V,  Out  A );
  00000000  Push EBP
  00000001  MoV EBP, ESP
  00000003  Push EDI
  00000004  Push ESI
 00000005  Sub ESP, 0f4h
10000000b mov ESI, ECx
10000000d Lea EDI, [EBP  +  Ffffff54h]
  00000013  MoV ECx, 29 H  //  29 H = 41  
  00000018 XOR eax, eax  //  Eax = 0  
  2017001a rep STOs dword ptr es: [EDI]  //  For (I <41) {New float = 0}  
  2017001c mov ECx, ESI
2017001e mov dword ptr [EBP  + Ffffff04h], ECx
  00000024  Cmp dword ptr ds: [03a97e8ch],  0   
2017002b je  00000032   
2017002d call 76cfd6c9  //  Throw exception here ???  
  00000032 Lea ECx, [EBP  +  0ch]  //  Pass ARG  
  00000035  Lea edX, [EBP  -  48 h]  //  Pass ARG  
  00000038 Call dword ptr ds: [00f46fb8h]  //  Call createtranslation 
 
This part of the code confused me for a long time at the beginning, because there are only three commands related to calling createtranslation: 032-038. What is the previous heap of code? Is it a compiler error? I thought I didn't need to check il when I had ASM. Unfortunately, it wasn't until I re-viewed Il. I suddenly realized: How did I forget part of the function's local stack initialization. Five temporary variables are used in FOO: two matrices and three vector3. The most interesting thing is. Instead of initializing five different struct values, the pencil initializes 41 independent float values! 018 ~ 01a completes initialization and initializes 41 float to 0. As for 000 ~ The 00D part is not very important. You can ignore them. The strange thing is 01c ~ In part, I have never understood the purpose of this Code. I guess it should be a security check, for example, whether there is stackover flow, and if so, jump to the function at 76cfd6c9c.
 
 
The conclusion is: the number of temporary variables in the function affects the function efficiency. 01a indicates repeated 41 times to initialize all variables.
 
 
Section 2:
 
Code 
  Matrix B  =  Matrix. createtranslation (v );
10000003e Lea eax, [EBP  +  0ch]
  00000041  Push dword ptr [eax  + 8  ]
  00000044  Sub ESP,  8   
  00000047  Movq xmm0, mmword PTR [eax]
2017004b movq mmword PTR [esp], xmm0
  00000050  Lea ECx, [EBP  + Ffffff14h]
  00000056  Call dword ptr ds: [00f46fach]
2017005c Lea EDI, [EBP  +  Ffffff78h]  //  Address of B  
  00000062  Lea ESI, [EBP  +  Ffffff14h] //  Address of returen Value  
  00000068  MoV ECx, 10 h  //  10 h = 16  
  2017006d rep movs dword ptr es: [EDI], dword ptr [esi]  //  Copy return value to B 
Obviously, createtranslation is much more complicated than above. I am not an assembly expert, so 041 ~ I am confused about what 050 did (hope someone can give me some advice ). For 05c ~ In the 06d part, createtranslation copies the return value to B. Note that 16 float values are copied here. However, it is not obvious that the address of createtranslation called this time is completely different from that of the previous time. If you are interested in viewing the ASM of createtranslation (vector3), you will find that it does two more tasks than createtranslation (ref vector3, out matrix): 1, create 16 local stacks of Float size; 2. When the calculation is complete, copy the value of the local stack to a temporary memory, which is also 16 mov.
The conclusion is that functions of non-ref versions execute 48 more commands than those of ref versions! For value type, only the transfer method will bring about huge performance differences.
 
 
Section 3:
 
 Code 
 Matrix. Multiply (  Ref  A,  Ref  B,  Out  A );
10000006f Lea eax, [EBP  -  48 h]
  00000072  Push eax
  00000073 Lea ECx, [EBP  -  48 h]
  00000076  Lea edX, [EBP  +  Ffffff78h]
2017007c call dword ptr ds: [00f472a8h] 
 
The preceding conclusion is verified again. For functions of the ref version, you only need to directly pass the parameter address and call the function.
 
 
Section 4:
 
Code 
  Part  1  :
Vector3 F1  =  B. forward;
  00000082  Lea ECx, [EBP  +  Ffffff78h]
  00000088 Lea edX, [EBP  +  Ffffff08h]
2017008e call dword ptr ds: [00f46f28h]
  00000094  Lea EDI, [EBP  +  Ffffff6ch]
10000009a Lea ESI, [EBP  + Ffffff08h]
201700a0 movq xmm0, mmword PTR [esi]
201700a4 movq mmword PTR [EDI], xmm0
201700a8 add ESI,  8   
1000000ab add EDI,  8 
201700ae movs dword ptr es: [EDI], dword ptr [esi]
 
 
Part 2 :
Vector3 F2;
F2.x  =  B. m31;
201700af rjdword PTR [EBP  -  68 h]
201700b2 fstp dword ptr [EBP  +  Ffffff60h]
F2.y  = B. M32;
201700b8 mongodword PTR [EBP  -  64 h]
201700bb fstp dword ptr [EBP  +  Ffffff64h]
F2.z  =  B. M33;
201700c1 1_dword PTR [EBP  - 60 h]
201700c4 fstp dword ptr [EBP  +  Ffffff68h] 
 
The two short pieces of code here complete the same thing. In the first version, the attributes of matrix are directly used, and in the second version, the basic elements are manually accessed. Forward is not inline, but the center attributes discussed by forward and test1 are not quite the same. The center is much simpler-directly returning an existing value, while forward needs to "combine" a value to return, so it is barely acceptable without inline. Surprisingly, 10 commands are used to access such a property. If you calculate the function called by 08e, you need to add 14 to this number for a total of 28! The code we manually inline only uses six commands, big win.
 
 
The conclusion is: Apply manual inline properly to improve performance.
 
 
Section 5:This part does not have much meaning, just to prevent the previous code from being optimized by the compiler.
 
Finally, I 'd like to have a rough idea about how to correctly use the mathematical library in xNa and how to optimize it. If you study the ASM code more, you will surely find more :)