Using SSE instruction set acceleration in C + + code

Last Update:2016-02-19 Source: Internet

Author: User

Tags scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With SSE directives, you first need to understand this class of instructions for initializing loading data and saving the Scratchpad's data to memory.

We know that most SSE directives are used by XMM0 to XMM8 registers, and before use, it is necessary to load data from memory into these registers.

1. Load series, for loading data, from memory to Scratchpad

__m128 _mm_load_ss (float *p)  __m128 _mm_load_ps (float *p)  __m128 _mm_load1_ PS (float *p)  *p)  *p)  __m128 _mm_loadr_ps (float *p)  __m128 _mm_loadu_ps (float *p)

The above is a function of the load series that is queried from the manual. which

_MM_LOAD_SS is used for scalar loading, so load a single-precision floating-point number to the low byte of the scratchpad, the other three bytes clear 0, (r0: = *p, r1: = r2: = R3: = 0.0).

_mm_load_ps is used for packed loading (all of the following are for packed), the address of P is 16-byte aligned, otherwise the result of reading will be error, (r0: = P[0], r1: = p[1], r2: = P[2], r3: = P[3]).

_mm_load1_ps represents the value of the P address that is loaded into the four bytes of the scratchpad and requires multiple instructions to complete, so, from a performance consideration, do not use this type of instruction in the inner layer loop. (R0: = R1: = r2: = R3: = *p).

_mm_loadh_pi and _MM_LOADL_PI are respectively used to load from a combination of two parameter high-bottom bytes. Specific reference manuals.

_mm_loadr_ps indicates that loading in _MM_LOAD_PS reverse order requires more than one instruction to complete, and of course the address is 16-byte aligned. (R0: = P[3], r1: = p[2], r2: = P[1], r3: = P[0]).

_mm_loadu_ps and _mm_load_ps are loaded, but the address is not required to be 16-byte aligned, and the corresponding instruction is movups.

2. Set series for loading data, most of which require more than one instruction, but may not require 16-byte alignment.

__m128 _mm_set_ss (float  w)  __m128 _mm_set_ps (floatfloatfloat  Float  W)  __m128 _mm_set1_ps (float  w)  __m128 _mm_setr_ps (float  float float float w)  __m128 _mm_setzero_ps ()

This series of functions is primarily similar to load operations, but may invoke multiple instructions to complete, and it is convenient to consider the alignment problem.

The _MM_SET_SS corresponds to the function of _MM_LOAD_SS, does not require byte alignment and requires multiple instructions. (R0 = w, r1 = r2 = R3 = 0.0)

The _mm_set_ps corresponds to the function of the _mm_load_ps, and the parameter is four separate single-precision floating-point numbers, so it does not require byte alignment and requires more than one instruction. (r0=w, r1 = x, r2 = y, R3 = Z, note order)

The _mm_set1_ps corresponds to the function of _mm_load1_ps, does not require byte alignment and requires multiple instructions. (R0 = R1 = R2 = R3 = W)

_mm_setzero_ps is clear 0 operation, only need one instruction. (R0 = R1 = r2 = R3 = 0.0)

3. Store series, used to store the data of SSE scratchpad such as calculation results in memory.

void_MM_STORE_SS (float*p, __m128 a)void_mm_store_ps (float*p, __m128 a)void_mm_store1_ps (float*p, __m128 a)void_MM_STOREH_PI (__m64 *p, __m128 a)void_MM_STOREL_PI (__m64 *p, __m128 a)void_mm_storer_ps (float*p, __m128 a)void_mm_storeu_ps (float*p, __m128 a)void_mm_stream_ps (float*p, __m128 a)

This series of functions corresponds to the function of the Load series function, which is basically a reverse process.

_MM_STORE_SS: An instruction, *p = a0

_mm_store_ps: An instruction, p[i] = A[i].

_mm_store1_ps: Multiple instructions, p[i] = a0.

_MM_STOREH_PI,_MM_STOREL_PI: The value holds its high or low.

_mm_storer_ps: Reverse, multiple instructions.

_mm_storeu_ps: One instruction, p[i] = A[i], does not require 16-byte alignment.

_mm_stream_ps: Writes directly to memory without changing the cache's data.

(2) Arithmetic instruction

SSE provides a large number of floating-point operation instructions, including addition, subtraction, multiplication, division, root, maximum, minimum, approximate to the reciprocal, the inverse of the root, and so on, can be seen the power of SSE instructions. It is easy to use these arithmetic instructions after knowing the instructions for data loading and data saving above, as an example of addition.

The instructions for floating-point addition in SSE are:

__m128 _mm_add_ss (__m128 A, __m128 b)

Where _MM_ADD_SS represents the scalar execution pattern, _mm_add_ps represents packed execution mode.

In general, using the SSE directive to write code, the steps are: using the Load/set function to load data from memory to the SSE Scratchpad, using the relevant SSE instructions to complete the calculation, etc. use the Store series function to save the results from the scratchpad to memory for later use.

Here is an example of completing the addition:

#include <intrin.h>intMainintargcChar*argv[]) {      floatop1[4] = {1.0,2.0,3.0,4.0}; floatop2[4] = {1.0,2.0,3.0,4.0}; floatresult[4];      __m128 A;      __m128 b;        __m128 C; //LoadA =_mm_loadu_ps (OP1); b=_mm_loadu_ps (OP2); //Calculatec = _mm_add_ps (A, b);//C = A + b//Store_mm_storeu_ps (result, c); /*//Using The __m128 union to get the result.     printf ("0:%lf\n", c.m128_f32[0]);     printf ("1:%lf\n", c.m128_f32[1]);     printf ("2:%lf\n", c.m128_f32[2]);     printf ("3:%lf\n", c.m128_f32[3]); */printf ("0:%lf\n", result[0]); printf ("1:%lf\n", result[1]); printf ("2:%lf\n", result[2]); printf ("3:%lf\n", result[3]); return 0; }

In this example, a similar addition example has been written before, and the printf part of the note uses the __M128 data type to get the relevant value.

This type is a union type, the specific definition can refer to the relevant header file, but, for the actual use, sometimes this value is an intermediate value, need to use later calculation, you have to use the store, more efficient.

The above uses _mm_loadu_ps and _MM_STOREU_PS, does not require byte alignment, and if you use _mm_load_ps and _mm_store_ps, you will find that the program crashes or does not get the correct results. The following is an implementation of the specified byte alignment:

#include <intrin.h>intMainintargcChar*argv[]) {__declspec (align ( -))floatop1[4] = {1.0,2.0,3.0,4.0}; __declspec (align ( -))floatop2[4] = {1.0,2.0,3.0,4.0}; _mm_align16floatresult[4];//A macro, same as "__declspec (align ())"__m128 A;      __m128 b;        __m128 C; //LoadA =_mm_load_ps (OP1); b=_mm_load_ps (OP2); //Calculatec = _mm_add_ps (A, b);//C = A + b//Store_mm_store_ps (result, c); /*//Using The __m128 union to get the result.     printf ("0:%lf\n", c.m128_f32[0]);     printf ("1:%lf\n", c.m128_f32[1]);     printf ("2:%lf\n", c.m128_f32[2]);     printf ("3:%lf\n", c.m128_f32[3]); */printf ("0:%lf\n", result[0]); printf ("1:%lf\n", result[1]); printf ("2:%lf\n", result[2]); printf ("3:%lf\n", result[3]); return 0; }

Original address: http://blog.csdn.net/gengshenghong/article/details/7011373

Using SSE instruction set acceleration in C + + code

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More