Instructions using SSE and other instruction sets in C/C ++ code (4) use of SSE instruction set intrinsic function

Source: Internet
Author: User

All intrinsic functions, corresponding Assembly commands, and how to use these functions can be queried in the intrinsic guide. Therefore, we will not analyze them all in the following sections, to learn how to use these advanced instruction sets in C/C ++ code, the query manual is easy to understand as to how to use more commands.

Note: The commands used below may only involve SSE instruction sets, rather than SSE series (SSE, sse2, sse3, ssse3, sse4.1, sse4.2, and so on) instruction sets. In addition, the following examples are all commands related to floating-point operations in SSE (as mentioned earlier, the new commands of SSE are divided into four types, and floating-point operations are only a major part ).

(1) FP load/store/Set

To use the SSE command, you must first understand the commands used to initialize and load data and save the data of the temporary storage to the memory. We know that, most SSE commands use xmm0 to xmm8 latches. Before using these commands, You need to load data from the memory to these latches.

1. Load series, used to load data, from memory to memory.

__m128 _mm_load_ss (float *p)__m128 _mm_load_ps (float *p)__m128 _mm_load1_ps (float *p)__m128 _mm_loadh_pi (__m128 a, __m64 *p)__m128 _mm_loadl_pi (__m128 a, __m64 *p)__m128 _mm_loadr_ps (float *p)__m128 _mm_loadu_ps (float *p)

The above are the Load series functions found in the Manual. Where,

_ Mm_load_ss is used to load scalar. Therefore, a single-precision floating point is loaded to the low byte of the temporary storage. The other three byte values are 0 (R0: = * P, R1: = R2: = R3: = 0.0 ).

_ Mm_load_ps is used for packed loading (all of the following are used for packed). The P address must be 16-byte aligned; otherwise, an error occurs in the read result (R0: = P [0], r1: = P [1], R2: = P [2], R3: = P [3]).

_ Mm_load1_ps indicates loading the value of P address to four bytes of the current storage. Multiple commands are required. Therefore, do not use such commands in the internal layer loop in terms of performance. (R0: = R1: = R2: = R3: = * P ).

_ Mm_loadh_pi and _ mm_loadl_pi are used to load from the combination of two parameters, such as high-bottom bytes. Reference manual.

_ Mm_loadr_ps indicates loading in the reverse order of _ mm_load_ps. Multiple commands are required. Of course, the address must be 16-byte aligned. (R0: = P [3], R1: = P [2], R2: = P [1], R3: = P [0]).

The _ mm_loadu_ps and _ mm_load_ps are loaded in the same way, but the address is not required to be 16-byte alignment. The corresponding command is movups.

2. Set series, used to load data, most of which require multiple commands, but may not require 16-byte alignment.

__m128 _mm_set_ss (float w)__m128 _mm_set_ps (float z, float y, float x, float w)__m128 _mm_set1_ps (float w)__m128 _mm_setr_ps (float z, float y, float x, float w)__m128 _mm_setzero_ps ()

This series of functions are mainly similar to the load operation, but may call multiple commands to complete the operation. It is convenient that alignment issues do not need to be considered.

_ Mm_set_ss corresponds to the function of _ mm_load_ss. It does not require byte alignment and requires multiple commands. (R0 = W, R1 = R2 = R3 = 0.0)

_ Mm_set_ps corresponds to the _ mm_load_ps function. The parameters are four independent single-precision floating point numbers. Therefore, there is no need for byte alignment and multiple commands are required. (R0 = W, R1 = x, R2 = Y, R3 = z, pay attention to the Order)

_ Mm_set1_ps corresponds to the function of _ mm_load1_ps, which does not require byte alignment and requires multiple commands. (R0 = R1 = R2 = R3 = W)

_ Mm_setzero_ps is a clear 0 operation. Only one command is required. (R0 = R1 = R2 = R3 = 0.0)

3. Store series, used to save the data of SSE latches such as computing results to the memory.

void _mm_store_ss (float *p, __m128 a)void _mm_store_ps (float *p, __m128 a)void _mm_store1_ps (float *p, __m128 a)void _mm_storeh_pi (__m64 *p, __m128 a)void _mm_storel_pi (__m64 *p, __m128 a)void _mm_storer_ps (float *p, __m128 a)void _mm_storeu_ps (float *p, __m128 a)void _mm_stream_ps (float *p, __m128 a)

This series of functions correspond to the functions of the Load series functions, basically a reverse process.

_ Mm_store_ss: a command, * P = A0

_ Mm_store_ps: a command, P [I] = A [I].

_ Mm_store1_ps: Multiple commands, P [I] = A0.

_ Mm_storeh_pi, _ mm_storel_pi: The value stores its high or low position.

_ Mm_storer_ps: reverse, multiple commands.

_ Mm_storeu_ps: a command, P [I] = A [I], with 16-byte alignment not required.

_ Mm_stream_ps: directly writes data to the memory without changing the cache data.

(2) arithmetic commands

SSE provides a large number of floating point operation commands, including addition, subtraction, multiplication, division, square, maximum, minimum, approximate reciprocal, and reciprocal, the strength of SSE commands can be seen. After learning about the preceding data loading and data storage commands, it is easy to use these arithmetic commands. The following uses addition as an example.

The floating point addition commands in SSE include:

__m128 _mm_add_ss (__m128 a, __m128 b)__m128 _mm_add_ps (__m128 a, __m128 b)

_ Mm_add_ss indicates the scalar execution mode, and _ mm_add_ps indicates the packed execution mode.

Generally, use the SSE command to write code. The procedure is to use the load/set function to load the data from the memory to the SSE temporary storage, and use the relevant SSE command to complete the computation; use the store functions to save the results from the memory to the memory for later use.

The following is an example of completing addition:

#include <intrin.h>int main(int argc, char* argv[]){float op1[4] = {1.0, 2.0, 3.0, 4.0};float op2[4] = {1.0, 2.0, 3.0, 4.0};float result[4];__m128  a;__m128  b;__m128  c;// Loada = _mm_loadu_ps(op1);b = _mm_loadu_ps(op2);// Calculatec = _mm_add_ps(a, b);// c = a + b// Store_mm_storeu_ps(result, c);/*// Using the __m128 union to get the result.printf("0: %lf\n", c.m128_f32[0]);printf("1: %lf\n", c.m128_f32[1]);printf("2: %lf\n", c.m128_f32[2]);printf("3: %lf\n", c.m128_f32[3]);*/printf("0: %lf\n", result[0]);printf("1: %lf\n", result[1]);printf("2: %lf\n", result[2]);printf("3: %lf\n", result[3]);return 0;}

In this example, a similar addition example has been written. The printf section in the annotation uses the _ m128 data type to obtain the relevant value. This type is a union type, for specific definitions, you can refer to the relevant header files. However, for actual use, sometimes this value is a median value and needs to be calculated and used later, so you have to use the store for higher efficiency. The above uses _ mm_loadu_ps and _ mm_storeu_ps, which do not require byte alignment. If _ mm_load_ps and _ mm_store_ps are used, the program will crash or fail to get the correct result. The following is an implementation method for specifying the byte alignment:

#include <intrin.h>int main(int argc, char* argv[]){__declspec(align(16)) float op1[4] = {1.0, 2.0, 3.0, 4.0};__declspec(align(16)) float op2[4] = {1.0, 2.0, 3.0, 4.0};_MM_ALIGN16 float result[4];// A macro, same as "__declspec(align(16))" __m128  a;__m128  b;__m128  c;// Loada = _mm_load_ps(op1);b = _mm_load_ps(op2);// Calculatec = _mm_add_ps(a, b);// c = a + b// Store_mm_store_ps(result, c);/*// Using the __m128 union to get the result.printf("0: %lf\n", c.m128_f32[0]);printf("1: %lf\n", c.m128_f32[1]);printf("2: %lf\n", c.m128_f32[2]);printf("3: %lf\n", c.m128_f32[3]);*/printf("0: %lf\n", result[0]);printf("1: %lf\n", result[1]);printf("2: %lf\n", result[2]);printf("3: %lf\n", result[3]);return 0;}

(3) Other commands

In addition to the preceding arithmetic commands, SSE also has some other floating point processing-related commands, such as floating point comparison, data conversion, and logic operations. The usage is similar, so we will not analyze them one by one. The key point is to master the load/set/store series functions so that other related computing and processing commands can be easily used.

(4) other instruction sets

After learning how to use these functions in the SSE instruction set, other instruction sets can easily be used, the intel intrinsic guide mentioned above includes intrinsic function queries for all intel processor instruction sets, including MMX, SSE, sse2, sse3, ssse3, sse4.1, sse4.2, avx, etc.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.