SSE command Optimization

Source: Internet
Author: User

SSE referece: http://docs.sun.com/app/docs/doc/817-5477/6mkuavhrm? A = View

 

From: http://goutie.blog.sohu.com/82010206.html

 

4.2.3 cross Multiplication
In another vector calculation, cross multiplication is often implemented, for details, see space resolution ry .). If there are two arbitrary vector vectors in a 3D space, the third vector can be obtained through the cross Multiplication operation. Interestingly, if these two vectors are not parallel, the third vector is perpendicular to the other two vectors.
An example of this operation is to find a normal vector on a plane or polygon to calculate the effect of light and shade. Again, SSE will show the magic of efficiency in front of C ++.
Inline void zfxvector: Cross (const zfxvector & U, const zfxvector & V)
{
If (! G_bsse)
{
X = U. y * v. Z-u. z * v. Y;
Y = U. z * v. X-u. x * v. Z;
Z = U. x * v. Y-u. y * v. X;
W = 1.0f
// The Translator's note: The preceding cross-multiplication formula is available. In another PDF, U is V1 and V is V2, so I suspect that the author is copyright. Please pay attention to it !!!
// At the same time, please learn from your friends.
}
Else
{
_ ASM {
MoV ESI, u
MoV EDI, V

Movups xmm0, [esi]
Movups xmm1, [EDI]
Movaps xmm2, xmm0
Xovaps xmm3, xmm1

Shufps xmm0, xmm0, 0xc9
Shufps xmm1, xmm1, 0xd2
Mulps xmm0, xmm1

Shufps xmm2, xmm2, 0xd2
Shufps xmm3, xmm3, 0xc9
Mulps xmm2, xmm3

Subps xmm0, xmm2
MoV ESI, this
Movups [esi], xmm0
}
W = 1.0f
}
}
  
In the above Code, we first need to put two vectors that require cross multiplication, each of which must be placed in two XMM registers, because we must mix the values of one register, then multiply it into another register. Therefore, the final product result is located in xmm0 and xmm2. If there is no concept of cross-multiplication, we have provided the corresponding C ++ code. I hope you can understand it. :), oh, there is also the last step. We use subs to implement the subtraction of two vectors, in this case, xmm0 is the final result of the Cross multiplication. Of course, remember to copy the result back to the memory and save it as needed.
Note! Note! Here, the mixed sorting control value is prefixed with 0x instead of the original H suffix. If this is not the case, it will be a big headache for the compiler to recognize hexadecimal notation, I think this should not be a problem for Real programmers. To put it bluntly, you must make your own efforts to solve any problem, rather than simply asking others if you do not analyze it. That way, a person will never grow. It is sometimes necessary to seek help, but not the most critical .)
  
4.2.4 Vector Matrix Multiplication
  
We often need to multiply a 4x4 matrix by a vector. Therefore, we also need a faster implementation. Here, we first use the 4x4 matrix class implemented in the zfxmatrix library to briefly describe the implementation of this operation.
Also, I assume that you know how to multiply vectors and matrices. Otherwise, you can use the C ++ Code in the following functions.
Zfxvector: Operator * (const zfxmatrix & M) const
{
Zfxvector vcresult;

If (! G_bsse)
{
Vcresult. x = x * M. _ 11 + y * M. _ 21 + z * M. _ 31 + M. _ 41;
Vcresult. Y = x * M. _ 12 + y * M. _ 22 + z * M. _ 32 + M. _ 42;
Vcresult. z = x * M. _ 13 + y * M. _ 23 + z * M. _ 33 + M. _ 43;
Vcresult. W = x * M. _ 14 + y * M. _ 24 + z * M. _ 34 + M. _ 44;

Vcresult. x = vcresult. X/vcresult. W;
Vcresult. Y = vcresult. Y/vcresult. W;
Vcresult. z = vcresult. Z/vcresult. W;
Vcresult. W = 1.0f
}
Else
{
Float * ptrret = (float *) & vcresult;
_ ASM {
MoV ECx, this; vector
MoV edX, M; Matrix
Movss xmm0, [ECx]
MoV eax, ptrret; Result Vector
Shufps xmm0, xmm0, 0
Movss xmm1, [ECx + 4]
Mulps xmm0, [edX]
Shufps xmm1, xmm1, 0
Movss xmm2, [ECx + 8]
Mulps xmm1, [edX + 16]
Shufps xmm2, xmm2, 0
Movss xmm3, [ECx + 12]
Mulps xmm2, [edX + 32]
Shufps xmm3, xmm3, 0
Addps xmm0, xmm1
Mulps xmm3, [edX + 48]
Addps xmm2, xmm3
Addps xmm0, xmm2
Movups [eax], xmm0; Save the result
MoV [eax + 3], 1; W = 1
}
}
Return vcresult;
}
  
The SSE knowledge we learned above is sufficient for you to implement multiplication of a vector and a matrix. It seems that there are a lot of mixed-Sort commands, but they are actually just broadcast.
By now, you can say that you have opened up your own way forward. You have gradually become proficient in SSE commands and their implementations by creating functions around these vectors. It should be noted that the portion that is used too frequently deserves our optimization. "Too frequent" means that you may call it hundreds of times in one frame. Therefore, we should not be surprised by the SSE Implementation of the vector. Although it is a bit complicated, it does work. Okay. Let's take a look at the vector here. Can we focus on the matrix now ?!

 

The following conversion from: http://www.gameres.com/document.asp? Topicid = 84488

Efficient 3D graphics mathematical Library (1) ---- vector Overview
I 've been diving for a long time. It's time to make some contributions. Recently I wrote this article and I will send it to you:

 

There have been many compilations recently, and it is simply a torment to look at the source code of C ++ code. This forces me to re-use assembly commands to implement all the mathematical libraries. Of course, including detecting cpuid and using Extended Instruction sets. The test result is compared with the d3dx9 mathematical function, and the effect is satisfactory. Except that the matrix multiplication algorithm is always 7% different from the d3dxmatrixmultiply function, the rest are flat or even far ahead (maybe I am crazy, and a new viewer can test it by myself ). As my technology is simple and the test efficiency method is simple, please correct me!
The first step is to introduce my Vector class. The following is the declaration:

Struct _ declspec (dllexport) vector
{

/****************** Variable ********************/

Float X, Y, Z, W;

*******************/

// Constructor
Vector (){}
// Constructor
Vector (const float * V );
// Constructor
Vector (float _ x, float _ y, float _ z, float _ w );

*******************/

// Set the Vector
Void setvector (const float * V );
// Set the Vector
Void setvector (float _ x, float _ y, float _ z, float _ w );
// Subtraction
Void difference (const vector * psrc, const vector * pdest );
// Reverse Flow
Void inverse ();
// Unit vector
Void normalize ();
// Whether the unit vector
Bool isnormalized ();
// Vector length (slow)
Float getlength ();
// The square of the vector length (FAST)
Float getlengthsq ();
// Calculate the cross multiplication using two vectors and save the result to this vector.
Void cross (const vector * pu, const vector * PV );
// Calculate the angle between two vectors
Float anglewith (vector & V );

*****************/

// Operator overload
Void operator + = (vector & V );
// Operator overload
Void operator-= (vector & V );
// Operator overload
Void operator * = (float V );
// Operator overload
Void operator/= (float V );
// Operator overload
Vector operator + (vector & V) const;
// Operator overload
Vector operator-(vector & V) const;
// Operator overload
Float operator * (vector & V) const;
// Operator overload
Void operator * = (gaiamatrix & M );
// Operator overload
Vector operator * (float f) const;
// Operator overload
Bool operator = (vector & V );
// Operator overload
Bool Operator! = (Vector & V );
// Operator overload
// Void operator = (vector & V );
};

Then there is a simple inline function:

// Constructor
Inline vector: vector (const float * V)
: X (V [0])
, Y (V [1])
, Z (V [2])
, W (V [3])
{
}

// Constructor
Inline vector: vector (float _ x, float _ y, float _ z, float _ w)
: X (_ x)
, Y (_ y)
, Z (_ z)
, W (_ w)
{
}

// Set the Vector
Inline void vector: setvector (const float * V)
{
X = V [0]; y = V [1]; Z = V [2];
}

// Set the Vector
Inline void vector: setvector (float _ x, float _ y, float _ z, float _ w)
{
X = _ x; y = _ y; Z = _ z; W = _ w;
}

// Subtraction
Inline void vector: difference (const vector * psrc, const vector * pdest)
{
X = pdest-> X-psrc-> X;
Y = pdest-> Y-psrc-> Y;
X = pdest-> Z-psrc-> Z;
}

// Reverse Flow
Inline void vector: inverse ()
{
X =-X; y =-y; Z =-Z;
}

// Whether the unit vector
Inline bool vector: isnormalized ()
{
Return cmpfloatsame (x * x + y * Y + z * z, 1.0f );
}

// Operator overload
Inline void vector: Operator + = (vector & V)
{
X + = V. X; y + = V. Y; Z + = V. Z;
}
// Operator overload
Inline void vector: Operator-= (vector & V)
{
X-= V. X; y-= V. Y; Z-= V. Z;
}
// Operator overload
Inline void vector: Operator * = (float F)
{
X * = f; y * = f; z * = F;
}
// Operator overload
Inline void vector: Operator/= (float F)
{
F = 1.0f/F;
X * = f; y * = f; z * = F;
}
// Operator overload
Inline vector: Operator + (vector & V) const
{
Return vector (x + v. X, Y + v. Y, Z + v. Z, W );
}
// Operator overload
Inline vector: Operator-(vector & V) const
{
Return vector (x-v.x, y-v.y, z-v.z, W );
}
// Operator overload
Inline float vector: Operator * (vector & V) const
{
Return (x * v. x + y * v. Y + z * v. z );
}
// Operator overload
Inline vector: Operator * (float f) const
{
Return vector (x * F, y * F, z * F, W );
}
// Operator overload
Inline bool vector: Operator = (vector & V)
{
Return (x-v.x) <float_eps & (x-v.x)>-float_eps) | (y-v.y) <float_eps & (y-v.y)>-float_eps) | (z-v.z) <float_eps & (z-v.z)>-float_eps ))? False: True );
}
// Operator overload
Inline bool vector: Operator! = (Vector & V)
{
Return (x-v.x) <float_eps & (x-v.x)>-float_eps) | (y-v.y) <float_eps & (y-v.y)>-float_eps) | (z-v.z) <float_eps & (z-v.z)>-float_eps ))? True: false );
}

There are several important optimizations here. They can also be used as the principle for writing code. They are very important:

1. You must use the const! The editor will use this for optimization.
2. When return returns a value, if yes, it must be returned in the form of a constructor. For example:
Return vector (x + v. X, Y + v. Y, Z + v. Z, W );
3. When multiple numbers are divided by the same number, they must be written in the format of vector: Operator/= (float F.
4. Such a small function must be inline!

The above four points must be observed; otherwise, the compilation code is terrible! Efficiency is also a huge trend. Remember to remember.

Next is the advanced functions of vector:

// The square of the vector length (FAST)
Float vector: getlengthsq () // potential danger
{
_ ASM
{
Mongodword PTR [ECx];
Fmul dword ptr [ECx];
Export dword ptr [ECx + 4];
Fmul dword ptr [ECx + 4];
Faddp ST (1), ST;
Export dword ptr [ECx + 8];
Fmul dword ptr [ECx + 8];
Faddp ST (1), ST;
}
// Return x * x + y * Y + z * z;
}

// Vector length (slow)
Float vector: getlength ()
{
Float F;
If (g_busesse2)
{
_ ASM
{
Lea ECx, F;
MoV eax, this;
MoV dword ptr [eax + 12], 0; // W = 0.0f;

Movups xmm0, [eax];
Mulps xmm0, xmm0;
Movaps xmm1, xmm0;
Shufps xmm1, xmm1, 4eh; shuffles
Addps xmm0, xmm1;
Movaps xmm1, xmm0;
Shufps xmm1, xmm1, 11 h; shuffles
Addss xmm0, xmm1;

Sqrtss xmm0 and xmm0; the first unit is used to calculate the square
Movss dword ptr [ECx], xmm0; the value of the first unit points to the memory space of ECx

MoV dword ptr [eax + 12], 3f800000h; // 3f800000h = 1.0f
}
}
Else
{
F = (float) SQRT (x * x + y * Y + z * z );
}
Return F;
}

// Unit vector
Void vector: normalize ()
{
If (g_busesse2)
{
_ ASM
{
MoV eax, this;
MoV dword ptr [eax + 12], 0;

Movups xmm0, [eax];
Movaps xmm2, xmm0;
Mulps xmm0, xmm0;
Movaps xmm1, xmm0;
Shufps xmm1, xmm1, 4eh;
Addps xmm0, xmm1;
Movaps xmm1, xmm0;
Shufps xmm1, xmm1, 11 h;
Addps xmm0, xmm1;

Rsqrtps xmm0, xmm0;
Mulps xmm2, xmm0;
Movups [eax], xmm2;

MoV dword ptr [eax + 12], 3f800000h;
}
}
Else
{
Float F = (float) SQRT (x * x + y * Y + z * z );
If (F! = 0.0f)
{
F = 1.0f/F;
X * = f; y * = f; z * = F;
}
}
}

// Calculate the cross multiplication using two vectors and save the result to this vector.
Void vector: Cross (const vector * pu, const vector * PV)
{
If (g_busesse2)
{
_ ASM
{
MoV eax, Pu;
MoV edX, PV;

Movups xmm0, [eax]
Movups xmm1, [edX]
Movaps xmm2, xmm0
Movaps xmm3, xmm1

Shufps xmm0, xmm0, 0xc9
Shufps xmm1, xmm1, 0xd2
Mulps xmm0, xmm1

Shufps xmm2, xmm2, 0xd2
Shufps xmm3, xmm3, 0xc9
Mulps xmm2, xmm3

Subps xmm0, xmm2

MoV eax, this
Movups [eax], xmm0

MoV [eax + 12], 3f800000h;
}
}
Else
{
X = pu-> y * PV-> Z-pu-> Z * PV-> Y;
Y = pu-> Z * PV-> X-pu-> X * PV-> Z;
Z = pu-> X * PV-> Y-pu-> y * PV-> X;
W = 1.0f;
}
}

// Operator overload
Void vector: Operator * = (Matrix & M) // potential danger
{
# Ifdef _ debug
Assert (W! = 1.0f & W! = 0.0f );
# Endif

If (g_busesse2)
{
_ ASM
{
MoV ECx, this;
MoV edX, M;
Movss xmm0, [ECx];
// Lea eax, VR;
Shufps xmm0, xmm0, 0; // xmm0 = x, x

Movss xmm1, [ECx + 4];
Mulps xmm0, [edX];
Shufps xmm1, xmm1, 0; // xmm1 = Y, y

Movss xmm2, [ECx + 8];
Mulps xmm1, [edX + 16];
Shufps xmm2, xmm2, 0; // xmm2 = z, Z

Movss xmm3, [ECx + 12];
Mulps xmm2, [edX + 32];
Shufps xmm3, xmm3, 0; // xmm3 = W, W

Addps xmm0, xmm1;
Mulps xmm3, [edX + 48];

Addps xmm0, xmm2;
Addps xmm0, xmm3; // xmm0 = Result
Movups [ECx], xmm0;
MoV [ECx + 12], 3f800000h;
}

}
Else
{
Vector VR;
VR. x = x * M. _ 11 + y * M. _ 21 + z * M. _ 31 + W * M. _ 41;
VR. Y = x * M. _ 12 + y * M. _ 22 + z * M. _ 32 + W * M. _ 42;
VR. z = x * M. _ 13 + y * M. _ 23 + z * M. _ 33 + W * M. _ 43;
VR. W = x * M. _ 14 + y * M. _ 24 + z * M. _ 34 + W * M. _ 44;

X = VR. X;
Y = VR. Y;
Z = VR. Z;
W = 1.0f;
}
}

// Calculate the angle between two vectors
Float vector: anglewith (vector & V)
{
Return (float) acossf (* This * V)/(this-> getlength () * v. getlength () * 2.0f ));
}

The following three functions are described: getlengthsq, * =, and anglewith.
Getlengthsq is potentially dangerous, because I am based on. code written by the net2003 editor. I know ECx = This, and that the float return value is directly from the floating point stack register fstp to the outside parameter. Therefore, I will use this method to write, no return value is even written! You may not use the same editor as me when reading this article. Therefore, after understanding the essence, you can use reasonable algorithms to implement your math library. All subsequent functions are written in an editor-independent method.

* = The potential danger of Operator Overloading is that vector is 4d and can represent 3D vectors or 3D coordinate points. If it is a vector, W = 0, which will only be affected by rotation and scaling. If it is a space point, W = 1, it will be subject to all types of changes, such as translation, rotation, and scaling. Because vectors cannot be translated and therefore are considered for operational efficiency, the caller of the mathematical library needs to pay attention to them.

The reason why the anglewith function is not internalized is that in future articles, I will further optimize the code here. Neither getlength nor acossf is an inline function. I have to expand it to compile the implementation and re-organize the encoding. This function does not seem to exist in the d3dx9 mathematical library ~~ There is no way to compare.

The efficiency of the above functions is roughly the same as that of the d3dx database:
Getlengthsq is slightly higher than d3dx
Getlength is twice the speed of d3dx because the d3d library does not use the SSE command.
The speed of normalize and cross is much higher than that of d3dx. The same reason is that the d3d library does not use the SSE command.
* = The efficiency is less than d3dxvec3transform by about 7%, which may be further improved! Let's take a look. The d3dx library uses 3 dnow! The operation is faster than SSE! Probably because of my amd3000 +... the speed should be almost the same for Inter.
Anglewith has no way to evaluate it, because there are no comparison objects.

Many algorithms have been manually rescheduled to find that the order of commands has a huge impact on efficiency! Be careful when changing the command order! It is best to copy the original one. Otherwise, you will be dizzy when running long assembly code ~ O ~
By the way, there are several questions that many people are confused about:
1. The code similar to _ mm_mov_ps () in the C ++ library is garbage! If you want efficiency, never use it. Learn the compilation and then write the code. The Code produced by the functions in those libraries is terrible!
2. The efficiency gap between movups and movaps is negligible! Do not declare a vector or matrix of _ m128 for the speed of so fast as 1%. You will get it when creating an array later!
3. My testing method is too good, that is, to cycle 10 million times, use timegettime () to check it. Run multiple times to find an average. Therefore, once the release mode is inline, the efficiency cannot be measured ~ If you have time, you can test it. It is estimated that the inline functions are close to the efficiency limit and are not worth optimization.
If you have any questions about my tests, you can take the test back and test the efficiency. If you want to change the CPU usage, I will accept the bricks from anyone here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.