SSE instruction Set learning: Compiler intrinsic

Last Update:2016-05-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Most functions are in the library, and theintrinsic function is embedded in the compiler (built in to the compiler).

1. Intrinsic Function

The intrinsic function acts as an inline, inserting code directly where it is called, avoiding the overhead of function calls and optimizing the function with more efficient machine instructions. Some intrinsic function behavior information built into the optimizer (Optimizer), you can do some optimizations for intrinsic that are not suitable for inline assembly, so the intrinsic function is usually is faster than the equivalent inline assembler (inline assembly) code. The optimizer can adapt the intrinsic function to different contexts, such as expanding the intrinsic function with different instructions, storing the buffer in the appropriate register, and so on.
The use of intrinsic function has a certain effect on the portability of code, because some intrinsic function is only applicable to Visual C + + and is not applicable on other compilers; some intrinsic function is targeted at a specific CPU architecture, not a full-platform generic. The above mentioned factors have some bad effects on the portability of intrinsic function code, but porting the code containing intrinsic function is undoubtedly a lot easier than inline assembly. In addition, the 64-bit platform no longer supports inline assembly.

2. SSE intrinsic

VS and GCC support SSE directive Intrinsic,sse have several different versions, the corresponding intrinsic is also included in the different header files, if the only one version of the SSE directive is determined to include only the corresponding header file.

Referenced from: http://www.cnblogs.com/zyl910/archive/2012/02/28/vs_intrin_table.html

For example, to use SSE3, the

#include <tmmintrin.h>

If you do not care about using that version of SSE instructions, you can include all

#include <intrin.h>

2.1 Data types

Intrinsic uses the data type and its register is to correspond, there is
* 64-bit MMX instruction set use
* 128-bit SSE instruction set use
* 256-bit AVX instruction set use

Even if the AVX-512 instruction set has 512-bit registers, then the corresponding intrinsic data will be 512 bits.
The specific data types and their descriptions are as follows:
1. __m64 64-bit corresponding data type, which can only be used by the MMX instruction set. Since the MMX instruction set can also use the 128-bit registers of the SSE instruction set, this data type is used less frequently.
2. __m128/__m128i/__m128d These three data types are 128-bit data types. Because the SSE instruction set is capable of manipulating integers and can operate floating-point types (single and double), these three data types represent different types of operands depending on the suffix. __m128 is a single-precision floating-point number, __m128i is integral, and __m128d is a double-precision floating-point number.

The data type of 256 and 512 is similar to 128-bit, but the number of storage is different, here is not to repeat.
Knowing the length of the various data types and the meaning of their code, what is the form of its expression? See

__m128i yy;

YY is a __m128i type, from which you can see that __m128i is a union (union) that contains different data types depending on the members. See its specific members contain 8-bit, 16-bit, 32-bit, and 64-bit signed/unsigned integers (here __m128i is integral type, so only members of integral type, floating-point number used __m128). Each member is an array filled with the corresponding data, and the lengths of the different arrays are different depending on the length of the data (array length = 128/length of each data (bit)). when using, you must pay special attention to the type of data to be manipulated, that is, the length of the data, such as the same variable yy as 4 32-bit signed integer used when its data is: 0,0,1024,1024; but as a 64-bit signed integer, its data is: 0, 4398046512128, greatly different.
MSVC can be used to yy.m128i_i32[0] remove the first 32-bit integer data, the native intrinsic function is not provided this function, which is in the msvc extension, compared to the style of Microsoft, use and its convenience but inefficient, so this method in gcc/ Clang below is not available. Under MSVC, you can use the method that does not use this data extraction as needed, but it is very handy when debugging code, such as it is easy to see that 128-bit data differs in its value under different data types.

2.2 Naming of intrinsic functions

The naming of intrinsic functions has a certain regularity, a intrinsic is usually composed of 3 parts, the specific meanings of the three parts are as follows:
1. The first part is the prefix _mm, which indicates that the intrinsic function corresponds to the SSE instruction set. The _mm256 or _mm512 is the intrinsic function prefix of the avx,avx-512 instruction set, which is only discussed in SSE and omitted for illustration.
2. The second part is the operation of the corresponding instruction, such as _add,_mul,_load, etc., some operations may have modifiers, such as loadu loading the operand without 16-bit alignment into the register.
3. The third part is the operation of the object name and data type, _PS packed operation of all single-precision floating-point number, _PD packed operation of all double-precision floating point number, _pixx (xx for length, can be 8,16,32,64) packed operation all xx bit signed integer, The register length used is 64 bits, the _epixx (xx is the length) packed the signed integer of all XX bits, the register length is 128 bits, the _EPUXX packed operations all the unsigned integers of xx bits; _SS operates the first single-precision floating-point number. ....

Combining the three parts into a complete intrinsic function, such as _MM_MUL_EPI32, multiplies all 32-bit signed integers in the parameter.

The SSE instruction set is very bad for branch processing, and extracting some element data from 128 bits of data is very expensive and therefore not suitable for operations with complex logic.

3. Intrinsic bilinear interpolation

In the previous article, the SSE instruction set optimizes learning: bilinear interpolation uses SSE assembly instructions to optimize bilinear interpolation algorithms, which are converted into intrinsic versions.

3.1 Calculation (Y * width + x) * Depth

The destination pixel needs to be mapped to the nearest 4 pixels around the source pixel, where it is worthwhile to calculate the offset of the nearest 4 pixel value of the source pixel at the same time.

__m128i wwidth = _MM_SET_EPI32 (0, Width,0, width); __m128i yy = _MM_SET_EPI32 (0, Y2,0, y1); yy = _mm_mul_epi32 (yy, wwidth);//y1 * Width 0 y2 *width 0yy = _mm_shuffle_epi32 (yy,0xd8);//y1 * Width y2 * Width 0 0yy = _mm_unpacklo_epi32 (yy, yy);//y1 * Width y2 * Width y1 * Width y2 * Widthyy = _mm_shuffle_epi32 (yy, _mm_shuffle (3,1,2,0));                __m128i xx = _mm_set_epi32 (x2, x2, x1, x1); xx = _mm_add_epi32 (xx, yy);//(X1,Y1) (X1,y2) (x2,y1) (x2,y2)__m128i x1x1 = _mm_shuffle_epi32 (xx,0x50);//(X1,Y1) (x1,y2)__m128i x2x2 = _mm_shuffle_epi32 (xx,0XFA);//(X2,Y1) (x2,y2)

Use the SET function to populate the __m128intel with the required data
The Mul function is multiplicative, and the result of multiplying two 32-bit integers is a 64-bit integer.
Because the pixel offset is calculated, using a 32-bit integer is sufficient, using shffule to rearrange the data in the __m128i, using the unpack function to regroup, and combining the data into the desired structure.
_MM_SHUFFLEis a macro that makes it easy to generate the immediate number required in the shuffle. For example

_mm_shuffle_epi32(yy,_MM_SHUFFLE(3,1,2,0);

The 2nd and 3rd 32-bit integers that are stored in YY are exchanged in the order.

3.2 Conversion of data types

SSE Assembly instruction and its intrinsic function between the basic existence of the one by one corresponding relationship, with the implementation of the Assembly to change to intrinsic is quite simple, and then in this listing code also what meaning. Here is a note of the biggest problem encountered in the process of using: conversions between data types .
Do image processing, because the pixel channel value is a 8-bit unsigned integer, and its operations are often floating-point numbers, which requires the conversion of 8-bit unsigned integer to a floating-point number, after the completion of the results are also written back to the image channel, it will be 8-bit unsigned integer, but also involves exceeding 8 bits of truncation. At the beginning of the failure to pay attention to eat a big loss ....
Type conversions are mainly the following:
1. Conversion of floating-point numbers and integers, and conversions between 32-bit floating-point numbers and 64-bit floating-point numbers. This conversion is straightforward and requires only the corresponding function instruction to be called.
2. High-order expansion of signed integers expands the 8-bit, 16-bit, and 32-bit signed integers to 16-bit, 32-bit, 64-bit.
3. Truncation of signed integers will have 16-bit, 32-bit, 64-bit signed compression
4. Extended unsigned integers to signed integers
The format of the above type conversions in the intrinsic function
* _mm_cvtepixx_epixx (XX is the number of digits 8/16/32/64) This is a conversion between signed integers
* _mm_cvtepixx_ps/_MM_CVTEPIXX_PD integer to Single-precision/double-precision floating-point number conversion
* _mm_cvtepuxx_epixx unsigned integers to signed integers, using a high-level 0 extension , which is a signed integer to the unsigned high 0 extension into the corresponding number of digits. No 32-bit unsigned integer is converted to an operation such as a 16-bit signed integer.
* _mm_cvtepuxx_ps/_mm_cvtepuxx_pd unsigned integers are converted to single-precision/double-precision floating-point numbers.

The above data conversion is also less of a, integer saturation conversion . What is a saturation conversion, where the maximum value is calculated as the maximum value, such as a 8-bit unsigned integer maximum of 255, then a value greater than 255 converted to 8-bit unsigned is treated as 255.
There are two types of saturation conversions for integers:
* The intrinsic function between the signed SSE provides two

__m128i _mm_packs_epi32(__m128i a, __m128i b)__m128i _mm_packs_epi16(__m128i a , __m128i b)

Used to convert 16/32-bit signed integer saturation to a 8/16-bit signed integer.
* Signed to unsigned

__m128i _mm_packus_epi32(__m128i a, __m128i b)__m128i _mm_packus_epi16(__m128i a , __m128i b)

Used to convert 16/32-bit signed integer saturation to a 8/16-bit unsigned integer

4. Comparison of SSE assembly instructions and intrinsic functions

This is just a rough comparison, after all, it's just a beginner. First of all, the use of pure assembly in the debug SSE code will be much faster, should be because there is no compiler optimization, the efficiency of the assembly code is still a great advantage. But below the release, it was mentioned that the Optimizer has built in the behavior information of the intrinsic function, and can provide a very powerful optimization to the intrinsic function , which makes no difference. PS: Should be due to the choice of data problems, ordinary C + + code, SSE assembly code and intrinsic function three in the release of the speed of the same, the compiler itself is very powerful optimization function.

4.1 Intrinsic function for multiple memory read and write operations

Another problem with using the intrinsic function in comparison is the access to the data. When using SSE Assembly, intermediate calculations can be saved to the XMM register, which is removed directly when used. The intrinsic function cannot manipulate the XMM register, and it cannot do so, it needs to write the result of each calculation back into memory and read it to the XMM register again when used.

yy = _mm_mul_epi32(yy, wwidth);

The above code is a 32-bit signed integer multiplication operation, the result of the calculation is saved in yy, after disassembly of its corresponding assembly code:

000B0428  Movaps      xmm0,Xmmword ptr [ebp-1b0h]000b042f  PMULDQ      xmm0,Xmmword ptr [ebp-190h]000B0438  Movaps      Xmmword ptr [ebp-7a0h],xmm0000b043f  Movaps      xmm0,Xmmword ptr [ebp-7a0h]000B0446  Movaps      Xmmword ptr [ebp-1b0h],xmm0

There are several operations in the assembly code above movaps . The above operation requires only one instruction when using the assembly

pmuludq xmm0, xmm1;

When using the intrinsic function, each function reads at least one memory, reads the operand from memory into the XMM register, and writes the result from the XMM register back to memory, which is saved to the variable, at the time of the memory write operation. Thus, in a very simple calculation (for example: the multiplication of 4 32-bit floating-point number) and the use of SSE assembly instructions will not be very different, but if the logic is slightly more complex or called intrinsic function more, there will be a lot of memory read and write operations, This is still a part of the loss of efficiency.

4.2 Comparison of intrinsic and SSE directives for simple operations

A more extreme example of the non-optimized C + + code is as follows:

_mm_align16floatA[] = {1.0f,2.0f,3.0f,4.0f}; _mm_align16floatB[] = {5.0f,6.0f,7.0f,8.0f};Const intCount =1000000000;floatc[4] = {0,0,0,0};cout<<"Normal Time (ms):";DoubleTstart =static_cast<Double> (Clock ()); for(inti =0; I < count; i++) for(intj =0; J <4; J + +) C[j] = A[j] + b[j];DoubleTEnd =static_cast<Double> (Clock ());

Multiple addition operations are performed on two arrays with 4 single-precision floating-point numbers, and the addition is repeated, and the result is the same for 1 times and 1000 times. The code for using SSE assembler instructions is as follows:

    fori0ii ++)        _asm        {            movaps xmm0, [a];            movaps xmm1, [b];            addps xmm0, xmm1;        }

Code that uses the intrinsic function:

    __m128 a1, b2;    __m128 c1;    fori0ii++)    {        a1 = _mm_load_ps(a);        b2 = _mm_load_ps(b);        c1 = _mm_add_ps(a1, b2);    }

Run under Debug

This result should be expected, SSE Assembler directive < intrinsic function < C + +. The SSE Assembler command is nearly 1/3 faster than the intrinsic function, and the following is the disassembly code for the intrinsic function

A1 = _mm_load_ps (a);00FB2570 movaps Xmm0,xmmword ptr [a]00FB2574 movaps Xmmword ptr [ebp- -H],xmm000Fb257b movaps Xmm0,xmmword ptr [ebp- -H00FB2582 movaps xmmword ptr [a1],xmm0 b2 = _mm_load_ps (b);00FB2586 movaps Xmm0,xmmword ptr [b]00Fb258a movaps Xmmword ptr [ebp- -H],xmm030%B2591 movaps Xmm0,xmmword ptr [ebp- -H00FB2598 movaps Xmmword ptr [b2],xmm0 c1 = _mm_add_ps (A1, B2);00Fb259f movaps xmm0,xmmword ptr [A1]00FB25A3 Addps Xmm0,xmmword ptr [B2]00FB25AA movaps Xmmword ptr [ebp-260H],xmm000FB25B1 movaps Xmm0,xmmword ptr [ebp-260H00FB25b8 movaps Xmmword ptr [c1],xmm0

You can see a total of 12 movaps instructions and a Addps instruction. SSE's assembly code only 2 Movaps instructions and a Addps instruction, it can be seen that the difference in time should be mainly due to intrinsic memory read and write caused.
The result of debug is not unexpected, then the result of release is really unexpected.

Using SSE assembly is the slowest, C + + implementation is better than fast, visible compiler optimization is very force. and intrinsic time is 0, what is going on. Viewing disassembly code discovery, that addition was performed only once, not many times. It should be the optimizer based on the intrinsic behavior of the prediction, the subsequent cycles are meaningless (a classmate told me that he is doing compiler generation code optimization, do the branch prediction, but also in the implementation, do not know what he said right).

5. Summary

The study SSE instruction nearly two weeks, has done two study notes, almost also is the introduction. This period of study is summarized as follows:
1. SSE instruction set as its name streaming SIMD Extensions, the most powerful is that it can in a single instruction parallel to multiple operands of the same operation, depending on the length of the operand and the length of the register can simultaneously calculate the number of different. In the case of a 32-bit signed integer, the 128-bit register (which is also the register of the most commonly used SSE instruction set) can operate 4 simultaneously, the 256-bit registers of the AVX instruction set can operate at the same time 8, and the 512-bit registers of the AVX-512 can operate 16 simultaneously.
2. The type of the main operand to be used when using SSE instruction, the integer type is to distinguish whether it is signed or unsigned, and the floating-point number is to note whether the precision is single or double precision. The other is the length of the operand. Even the same 128-bit binary string has several different interpretations depending on its type and length.
3. As mentioned many times earlier, the compiler's optimization ability is very strong, do not deliberately use SSE command optimization. In order to use SSE, it is important to remember that the power of SSE is its parallel capability.

Again a sunny Friday afternoon, said today to be under the heavy rain, the morning did not dare to ride a bicycle to work, go back to crowded bus ah. In other words, why not say take the bus or take the bus, but to squeeze the bus.

SSE instruction Set learning: Compiler intrinsic

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SSE instruction Set learning: Compiler intrinsic

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

SSE instruction Set learning: Compiler intrinsic

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support