Using SSE instruction set acceleration in C + + code

Source: Internet
Author: User
Tags scalar

With SSE directives, you first need to understand this class of instructions for initializing loading data and saving the Scratchpad's data to memory.

We know that most SSE directives are used by XMM0 to XMM8 registers, and before use, it is necessary to load data from memory into these registers.

1. Load series, for loading data, from memory to Scratchpad

__m128 _mm_load_ss (float *p)  __m128 _mm_load_ps (float *p)  __m128 _mm_load1_ PS (float *p)  *p)  *p)  __m128 _mm_loadr_ps (float *p)  __m128 _mm_loadu_ps (float *p)  

The above is a function of the load series that is queried from the manual. which

_MM_LOAD_SS is used for scalar loading, so load a single-precision floating-point number to the low byte of the scratchpad, the other three bytes clear 0, (r0: = *p, r1: = r2: = R3: = 0.0).

_mm_load_ps is used for packed loading (all of the following are for packed), the address of P is 16-byte aligned, otherwise the result of reading will be error, (r0: = P[0], r1: = p[1], r2: = P[2], r3: = P[3]).

_mm_load1_ps represents the value of the P address that is loaded into the four bytes of the scratchpad and requires multiple instructions to complete, so, from a performance consideration, do not use this type of instruction in the inner layer loop. (R0: = R1: = r2: = R3: = *p).

_mm_loadh_pi and _MM_LOADL_PI are respectively used to load from a combination of two parameter high-bottom bytes. Specific reference manuals.

_mm_loadr_ps indicates that loading in _MM_LOAD_PS reverse order requires more than one instruction to complete, and of course the address is 16-byte aligned. (R0: = P[3], r1: = p[2], r2: = P[1], r3: = P[0]).

_mm_loadu_ps and _mm_load_ps are loaded, but the address is not required to be 16-byte aligned, and the corresponding instruction is movups.

2. Set series for loading data, most of which require more than one instruction, but may not require 16-byte alignment.

__m128 _mm_set_ss (float  w)  __m128 _mm_set_ps (floatfloatfloat  Float  W)  __m128 _mm_set1_ps (float  w)  __m128 _mm_setr_ps (float  float float float w)  __m128 _mm_setzero_ps ()  

This series of functions is primarily similar to load operations, but may invoke multiple instructions to complete, and it is convenient to consider the alignment problem.

The _MM_SET_SS corresponds to the function of _MM_LOAD_SS, does not require byte alignment and requires multiple instructions. (R0 = w, r1 = r2 = R3 = 0.0)

The _mm_set_ps corresponds to the function of the _mm_load_ps, and the parameter is four separate single-precision floating-point numbers, so it does not require byte alignment and requires more than one instruction. (r0=w, r1 = x, r2 = y, R3 = Z, note order)

The _mm_set1_ps corresponds to the function of _mm_load1_ps, does not require byte alignment and requires multiple instructions. (R0 = R1 = R2 = R3 = W)

_mm_setzero_ps is clear 0 operation, only need one instruction. (R0 = R1 = r2 = R3 = 0.0)

3. Store series, used to store the data of SSE scratchpad such as calculation results in memory.

void_MM_STORE_SS (float*p, __m128 a)void_mm_store_ps (float*p, __m128 a)void_mm_store1_ps (float*p, __m128 a)void_MM_STOREH_PI (__m64 *p, __m128 a)void_MM_STOREL_PI (__m64 *p, __m128 a)void_mm_storer_ps (float*p, __m128 a)void_mm_storeu_ps (float*p, __m128 a)void_mm_stream_ps (float*p, __m128 a)

This series of functions corresponds to the function of the Load series function, which is basically a reverse process.

_MM_STORE_SS: An instruction, *p = a0

_mm_store_ps: An instruction, p[i] = A[i].

_mm_store1_ps: Multiple instructions, p[i] = a0.

_MM_STOREH_PI,_MM_STOREL_PI: The value holds its high or low.

_mm_storer_ps: Reverse, multiple instructions.

_mm_storeu_ps: One instruction, p[i] = A[i], does not require 16-byte alignment.

_mm_stream_ps: Writes directly to memory without changing the cache's data.

(2) Arithmetic instruction

SSE provides a large number of floating-point operation instructions, including addition, subtraction, multiplication, division, root, maximum, minimum, approximate to the reciprocal, the inverse of the root, and so on, can be seen the power of SSE instructions. It is easy to use these arithmetic instructions after knowing the instructions for data loading and data saving above, as an example of addition.

The instructions for floating-point addition in SSE are:

__m128 _mm_add_ss (__m128 A, __m128 b)  

Where _MM_ADD_SS represents the scalar execution pattern, _mm_add_ps represents packed execution mode.

In general, using the SSE directive to write code, the steps are: using the Load/set function to load data from memory to the SSE Scratchpad, using the relevant SSE instructions to complete the calculation, etc. use the Store series function to save the results from the scratchpad to memory for later use.

Here is an example of completing the addition:

#include <intrin.h>intMainintargcChar*argv[]) {      floatop1[4] = {1.0,2.0,3.0,4.0}; floatop2[4] = {1.0,2.0,3.0,4.0}; floatresult[4];      __m128 A;      __m128 b;        __m128 C; //LoadA =_mm_loadu_ps (OP1); b=_mm_loadu_ps (OP2); //Calculatec = _mm_add_ps (A, b);//C = A + b//Store_mm_storeu_ps (result, c); /*//Using The __m128 union to get the result.     printf ("0:%lf\n", c.m128_f32[0]);     printf ("1:%lf\n", c.m128_f32[1]);     printf ("2:%lf\n", c.m128_f32[2]);     printf ("3:%lf\n", c.m128_f32[3]); */printf ("0:%lf\n", result[0]); printf ("1:%lf\n", result[1]); printf ("2:%lf\n", result[2]); printf ("3:%lf\n", result[3]); return 0; }  

In this example, a similar addition example has been written before, and the printf part of the note uses the __M128 data type to get the relevant value.

This type is a union type, the specific definition can refer to the relevant header file, but, for the actual use, sometimes this value is an intermediate value, need to use later calculation, you have to use the store, more efficient.

The above uses _mm_loadu_ps and _MM_STOREU_PS, does not require byte alignment, and if you use _mm_load_ps and _mm_store_ps, you will find that the program crashes or does not get the correct results. The following is an implementation of the specified byte alignment:

#include <intrin.h>intMainintargcChar*argv[]) {__declspec (align ( -))floatop1[4] = {1.0,2.0,3.0,4.0}; __declspec (align ( -))floatop2[4] = {1.0,2.0,3.0,4.0}; _mm_align16floatresult[4];//A macro, same as "__declspec (align ())"__m128 A;      __m128 b;        __m128 C; //LoadA =_mm_load_ps (OP1); b=_mm_load_ps (OP2); //Calculatec = _mm_add_ps (A, b);//C = A + b//Store_mm_store_ps (result, c); /*//Using The __m128 union to get the result.     printf ("0:%lf\n", c.m128_f32[0]);     printf ("1:%lf\n", c.m128_f32[1]);     printf ("2:%lf\n", c.m128_f32[2]);     printf ("3:%lf\n", c.m128_f32[3]); */printf ("0:%lf\n", result[0]); printf ("1:%lf\n", result[1]); printf ("2:%lf\n", result[2]); printf ("3:%lf\n", result[3]); return 0; }  

Original address: http://blog.csdn.net/gengshenghong/article/details/7011373

Using SSE instruction set acceleration in C + + code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.