SSE instructions and C ++ applications

Source: Internet
Author: User

SSE is a new generation (of course a few years ago) CPU Instruction Set proposed by Intel after MMX, which was first applied to PIII series CPUs. Now we have supported Intel PIII, P4, Celeon, Xeon, AMD Athlon, duron, and other series of CPUs. The updated SSE2 instruction set only supports the P4 series CPU, which is one of the reasons why this article is about SSE rather than sse2. Another reason is that the SSE and SSE2 command systems are very similar. SSE2 has only a small number of additional floating point processing functions, 64-bit floating point operations, and 64-bit integer operations.

Why is SSE faster than traditional floating point operations? Because it uses a 128-bit storage unit, this can store four 32-bit floating point numbers, that is, all calculations in SSE are completed for four floating point numbers at a time. This batch processing will certainly improve the efficiency. Let's review the full name of SSE: Stream SIMD Extentions (Stream SIMD extension ). SIMD is single instruction multiple data, and the connection is "data Stream single command multi-data extension". From the name, we can better understand how SSE works.

In theory, SSE is faster than traditional floating point operations, but it has many limitations. First, although it is equivalent to four times, it is faster than the traditional floating point operation to execute four times, but it is not as fast as imagined, so it must reflect the speed of SSE, the premise of Stream is that a large amount of Stream data can play a powerful role in SIMD. Second, SSE supports four 32-bit (128-bit in total) floating point data sets, that is, float [4] in C and C ++ languages. it must be 16-bit boundary alignment (code will be explained later. For the concept of boundary alignment, you can refer to other articles on the forum for detailed answers, I will not go into details here ). Therefore, this also brings a lot of trouble to the input and output. In fact, what mainly affects SSE's performance is to constantly replicate data to adapt to its data format.

I am a C ++ programmer and I am not very familiar with assembly, but I want to use SSE to optimize my program. What should I do? Fortunately, VC ++. net provides us with convenient command C function-level encapsulation and C format data types, we only need to define variables and call functions just like writing C ++ code, so that SSE commands can be well applied.

Of course, we need to include a header file, which includes the data types and function declarations we need:

# Include <xmmintrin. h>
 
There is only one standard data type for SSE operations, namely:

_ M128, which is defined as follows:

Typedef struct _ declspec (intrin_type) _ declspec (align (16) _ m128 {

Float m128_f32 [4];

} _ M128;
 
To simplify the process, you can:

Struct _ m128

{

Float m128_f32 [4];

};
 
For example, to define a _ m128 variable and assign four float Integers to it, you can write as follows:

_ M128 S1 = {1.0f, 2.0f, 3, 0f, 4, 0f };
 
To change the 2nd elements (0 as the base), you can write as follows:

S1.m128 _ f32 [2] = 6.0f;
 
In addition, we will also use several value assignment commands, which makes it easier for us to use this data structure:

S1 = _ mm_set_ps1 (2.0f );
 
It will assign all four elements in S1.m128 _ f32 to 2.0f, which is much faster than assigning values one by one.

S1 = _ mm_setzero_ps ();
 
This causes all four floating point numbers in S1 to be set to zero.

There are other value assignment commands, but they are not easy to assign values one by one. They are only used for some special purposes. If you want to know more information, for details, refer to MSDN-> visual c ++ Reference-> C/C ++ Language-> C ++ Language Reference-> Compiler Intrinsics-> MMX, SSE, and SSE2 Intrinsics-> Stream SIMD Extensions (SSE) chapter.

Generally, all SSE instruction functions are composed of three parts separated by underscores:

_ Mm_set_ps1
 
Mm indicates the multimedia Extended Instruction Set

Set indicates the abbreviation of this function.

Ps1 indicates the effect of the function on the result variable. It consists of two letters. The first letter indicates the effect on the result variable. p indicates the result as a pointer to a group of data, each element is involved in the operation. "S" indicates that only the first element in the result variable is involved in the operation. The second letter indicates the data type involved in the operation. S indicates 32-bit floating point numbers, d Indicates 64-bit floating point numbers, i32 indicates 32-Bit fixed points, and i64 indicates 64-Bit fixed points. Because SSE only supports 32-bit floating point numbers, therefore, you may not be able to find non-s modifiers in these command encapsulation functions, but you can understand them in the MMX and SSE2 commands.

Next, I will give an example to illustrate how SSE instruction functions are used. It must be noted that the following code is written on the VC7.1 platform, it is not guaranteed to be fully compatible with other development platforms such as Dev-C ++ and Borland C ++.

In order to facilitate the comparison speed, I will write it using the regular method and SSE optimization, and use a test speed class CTimer for timing.

This algorithm scales up a set of float values. The ScaleValue1 function is optimized using the SSE command, but the ScaleValue2 function does not. We use the float array data of 10000 elements to test the two algorithms. Each algorithm is calculated 10000 times. The following is the test program and result:

# Include <xmmintrin. h>

# Include <windows. h>


Class CTimer

{

Public:

_ Forceinline CTimer (void)

{

QueryPerformanceFrequency (& m_Frequency );

QueryPerformanceCounter (& m_StartCount );

}

_ Forceinline void Reset (void)

{

QueryPerformanceCounter (& m_StartCount );

}

_ Forceinline double End (void)

{

Static _ int64 nCurCount;

QueryPerformanceCounter (PLARGE_INTEGER) & nCurCount );

Return double (nCurCount * (_ int64 *) & m_StartCount)/double (* (_ int64 *) & m_Frequency );

}

Private:

LARGE_INTEGER m_Frequency;

LARGE_INTEGER m_StartCount;

};

Void ScaleValue1 (float * pArray, DWORD dwCount, float fScale)

{

DWORD dwGroupCount = dwCount/4;

_ M128 e_Scale = _ mm_set_ps1 (fScale );

For (DWORD I = 0; I <dwGroupCount; I ++)

{

* (_ M128 *) (pArray + I * 4) = _ mm_mul_ps (* (_ m128 *) (pArray + I * 4), e_Scale );

}

}

Void ScaleValue2 (float * pArray, DWORD dwCount, float fScale)

{

For (DWORD I = 0; I <dwCount; I ++)

{

PArray [I] * = fScale;

}

}

# Define ARRAYCOUNT 10000

Int _ cdecl main ()

{

Float _ declspec (align (16) Array [ARRAYCOUNT];

Memset (Array, 0, sizeof (float) * ARRAYCOUNT );

CTimer t;

Double dTime;

T. Reset ();
For (int I = 0; I <100000; I ++)

{

ScaleValue1 (Array, ARRAYCOUNT, 1000.0f );

}

DTime = t. End ();

Cout <"Use SSE:" <dTime <"seconds" <endl;

T. Reset ();

For (int I = 0; I <100000; I ++)

{

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.