Instructions for using SSE and other instruction sets in C + + code (4) SSE instruction set intrinsic function using __jquery

Source: Internet
Author: User
Tags arithmetic scalar

Some manuals are listed in http://blog.csdn.net/gengshenghong/article/details/7008682, where the Intel intrinsic Guide can query all intrinsic functions, Corresponding assembly instructions and how to use, etc., so, the next is not all analysis, the following only part of the analysis, so as to understand how to use these advanced instruction set in C + + code basic methods, as for more instructions to use, the query manual is easy to understand.

Instructions: The following instructions may only involve SSE instruction set, rather than SSE series (SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, etc.) instruction set. In addition, the following examples are and SSE in the floating-point operations related directives (mentioned above, SSE's new instructions are divided into 4 classes, floating-point operations is only a major part).

(1) FP Load/store/set

Using SSE directives, first of all, to understand this class of instructions for initializing the load data and saving the data of the registers to memory, we know that most SSE directives are used for XMM0 to XMM8 registers, so before you can use them, you need to load the data from memory into these registers.

1. Load series, for loading data, from memory to registers.

__m128 _mm_load_ss (float *p) __m128 _mm_load_ps (float *p) __m128 _mm_load1_ps (float *p) __m128 _mm_loadh_pi (__m1 __m64 *p) __m128 _mm_loadl_pi (__m128 A, __m64 *p) __m128 _mm_loadr_ps (float *p) __m128 _mm_loadu_ps (FLOAT * p) above is a function of the load series that is queried from the manual. which

_MM_LOAD_SS is used for scalar loading, so load a single-precision floating-point number to the register low byte, the other three bytes clear 0, (r0: = *p, r1: = r2: = R3: = 0.0).

_mm_load_ps is used for packed loading (all of the following are for packed), requiring P's address to be 16-byte aligned, otherwise the result of the reading will be wrong, (r0: = P[0], r1: = p[1], r2: = P[2], r3: = P[3]).

_mm_load1_ps indicates that the value of the P address, loaded into the four bytes of the register, requires more than one instruction to complete, so do not use such instructions from the performance considerations in the inner loop. (R0: = R1: = r2: = R3: = *p).

_mm_loadh_pi and _MM_LOADL_PI are used to load from a combination of two parameter high base bytes. Specific reference manual.

_mm_loadr_ps indicates that the _MM_LOAD_PS is loaded in reverse order, requiring more than one instruction to complete, and of course, the address is 16-byte aligned. (R0: = P[3], r1: = p[2], r2: = P[1], r3: = P[0]).

_mm_loadu_ps and _mm_load_ps are loaded, but the address is not required to be 16-byte aligned, and the corresponding instruction is movups.

2. Set series, used to load data, most require more than one instruction to complete, but may not require 16-byte alignment.

__m128 _mm_set_ss (float W) __m128 _mm_set_ps (float z, float y, float x, float W) __m128 _mm_set1_ps (float W) __m1 _mm_setr_ps (float z, float y, float x, float W) __m128 _mm_setzero_ps ()

This series of functions is essentially a load-like operation, but may invoke multiple instructions to complete, conveniently without the need to consider alignment issues.

_MM_SET_SS corresponds to the function of the _MM_LOAD_SS, does not require byte alignment, requires more than one instruction. (R0 = w, r1 = r2 = R3 = 0.0)

_mm_set_ps corresponds to the _MM_LOAD_PS function, the parameter is four separate single-precision floating-point numbers, so there is no need for byte alignment, more than one instruction is required. (r0=w, r1 = x, r2 = y, R3 = Z, attention order)

_mm_set1_ps corresponds to the function of the _MM_LOAD1_PS, does not require byte alignment, requires more than one instruction. (R0 = R1 = R2 = R3 = W)

_mm_setzero_ps is a 0 operation and requires only one instruction. (R0 = R1 = r2 = R3 = 0.0)

3. Store series, which is used to save the data of SSE registers such as calculation results into memory.

void _mm_store_ss (float *p, __m128 a) void _mm_store_ps (float *p, __m128 a) void _mm_store1_ps (float *p, __m128 a)  void _mm_storeh_pi (__m64 *p, __m128 a) void _mm_storel_pi (__m64 *p, __m128 a) void _mm_storer_ps (float *p, __m128 a) void _mm_storeu_ps (float *p, __m128 a) void _mm_stream_ps (float *p, __m128 a) This series of functions corresponds to the function of the load series function and is basically a reverse Ride.

_MM_STORE_SS: An instruction, *p = a0

_mm_store_ps: An instruction, p[i] = A[i].

_mm_store1_ps: More than one instruction, p[i] = a0.

_MM_STOREH_PI,_MM_STOREL_PI: Value to save its high or low.

_mm_storer_ps: Reverse, more than one instruction.

_mm_storeu_ps: An instruction, p[i] = A[i], does not require 16-byte alignment.

_mm_stream_ps: Write directly to memory without changing the cache data.

(2) Arithmetic instruction

SSE provides a large number of floating-point operation instructions, including addition, subtraction, multiplication, division, root, maximum, minimum, approximate reciprocal, seek the reciprocal of the root and so on, can be seen the strength of SSE directives. So after understanding the above data loading and data save instructions, it is easy to use these arithmetic instructions, the following is an example of addition.

The instructions for floating point addition in SSE are:

__m128 _mm_add_ss (__m128 A, __m128 b) __m128 _mm_add_ps (__m128 A, __m128 b), _MM_ADD_SS represents the scalar execution mode, _MM_ADD_PS represents PA cked execution mode.

In general, use the SSE instruction to write code, the step is: using the Load/set function to load data from memory to the SSE register, using the relevant SSE instructions to complete the calculation, etc. use the Store series function to save the result from the register to memory for later use.

Here is an example of completing an addition:

#include  <intrin.h>      int main (int argc, char* argv[])    {       float op1[4] = {1.0, 2.0, 3.0,  4.0};       float op2[4] = {1.0, 2.0, 3.0, 4.0};        float result[4];          __ m128  a;       __m128  b;        __m128  c;          // Load        a = _mm_loadu_ps (OP1);       b = _mm_loadu_ps (OP2);           // Calculate       c =  _mm_add_ps (a, b);   // c = a + b          // store       _mm_storeu_ps (result, c);          /*      // using the __m128  union to get the result. 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.