Compile the SIMD Instruction Program on Linux

Last Update:2018-12-06 Source: Internet

Author: User

Tags builtin

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Compile the SIMD Instruction Program on Linux

(1) Add _ MMX __, _ SSE _ predefinitions during G ++ compilation;

(2) When GCC is linked-March = pentium4-mmmx-MSSE-m3dnow

(3) including header files: xintrinsic. h

If you use eclipse for development, pay attention to the settings.

IA-32 intel architecture instructions are mainly divided into the following categories:

General
X87 FPU
MMX technology
SSE/sse2/sse3 extensions

MMX/SSE class extensions introduce the SIMD (single command and multiple data) execution mode, which can be used to accelerate multimedia applications. The following describes the execution environment and features of these commands.

8 32-bit General registers can be used for various SIMD extensions;
MMX: 8 64-bit MMX registers (mm0-mm7), which can also be used for each SSE extension;
- The data is an integer and supports up to two 32-bit
- No registers in the operation can indicate overflow.
SSE: 8 128-bit XMM registers, mxscr registers, and eflags registers
- Supports Single-precision floating point
- Mxscr contains the rounding and overflow flag
- Supports 64-bit SIMD Integers
Sse2: The execution environment is the same as SSE.
- Double Precision Floating Point
- An integer of 128 digits.
- Dual-precision Conversion
Sse3: 13 commands were released shortly after the Inte Prescott processor was released.
- It mainly enhances video decoding, 3D image optimization, and hyperthread performance.

MMX technology was the first to appear. Currently, almost all x86 processors are supported, including embedded x86. Therefore, the following discussion is mainly based on MMX, but the method is fully applicable to ssen, including other SIMD extensions such as AMD's 3D now.

MMX commands are divided into the following types:

Data transmission: movd, movq
Data conversion: packsswb, packssdw, packuswb, punpckhbw, punpckhwd, punpckhdq, punpcklbw, punpcklwd, punpckldq
Parallel Arithmetic: paddb, paddw, paddd, paddsb, paddsw, paddusb, Baidu, psubb, psubw, psubd, psubsb, psubsw, psubusb, psubusb, psubusw, pmulhw, Baidu, and Baidu
Parallel comparison: pcmpeqb, pcmpeqw, pcmpeqd, pcmpgtb, pcmpgtw, pcmpgtd
Parallel Logic: pand, pandn, por, pxor
Shift and rotation: psllw, pslld, psllq, psrlw, psrld, psrscsi, psraw, psrad
Status Management: Emms

In addition to functions, these commands also need to pay attention to the data types to be processed. The above content is the background. For details, refer to the manual.

Performance Optimization... Performance Optimization

When C/C ++ is used to complete all the functions of an embedded application, performance problems are always in front of them. In this case, you can use profile tools (such as GPROF) to find out the functions that generate bottlenecks, these functions are completely rewritten using assembler, such as the MPEG-4 codecs Xvid project [4] using this method, and different optimizations are given for different processors/instruction sets, this is precisely the case where the project is top-notch in terms of functionality and performance. This is clearly the goal of in-depth optimization.

On the architecture of pipelines, VLIW, and SIMD (such as some DSPs), Manual Optimization of the entire function can increase the performance by several to dozens of times. However, performance allows you to use some specific implementations for key functions, which not only highlights the performance improvement, but also makes full use of the advanced features of C/C ++, relatively shorten the development cycle. Below are some mixed programming methods to apply the MMX command when using gcc:

Intel C/C ++ compiler intrinsics
GCC builtin operations
Embedded Assembly ASM construct

Intel C/C ++ compiler intrinsics... Intel C/C ++ compiler intrinsics

When you look at the IA-32 intel Instruction Set manual, some of the Instructions explain an "Intel C/C ++ compiler intrinsic equivalent" that identifies the equivalent intrinsic of the command. The syntax of intrinsic in C/C ++ programs is in the form of functions. during compilation, you can directly translate it into an MMX command (which generates the most direct combination). In other words, if you do not use intrinsic, multiple C/C ++ statements may be required, but the compiler cannot ensure that these statements can generate the most efficient MMX command. Not every MMX command has an equivalent intrinsic. the appendix of the Manual lists all of them. They are classified into simple and composite, each simple type corresponds to one command, while the compound type corresponds to multiple commands.

GCC supports intel C/C ++ compiler intrinsics. Example:

# Include <stdio. h>
# Include <xmmintrin. h>/* This header file must be included */
/* Gcc-wall-March = pentium4-mmmx-O ins mmx_ins.c */
Int main (INT argc, char * argv [])
{
/* Use MMX to perform the dot product of the following Vectors */
Short in1 [] = {1, 2, 3, 4 };
Short in2 [] = {2, 3, 4, 5 };
Int out1;
Int out2;
_ M64 M1;/* MMX supports 64-bit integer mm REGISTERS */
_ M 64 m2;/* MMX operation requires mm REGISTERS */
_ M128 m128;/* For ssen only */
/* Load two short-type numbers into the MM register each time. Note that there are two short-type numbers */
M1 = _ mm_cvtsi32_si64 (int *) in1) [0]);
M2 = _ mm_cvtsi32_si64 (int *) in2) [0]);
/* A command is used to multiply and add four 16-digit integers */
/* Generate two 32-bit integers */
M2 = _ mm_madd_pi16 (M1, M2 );
/* Put a 32-bit integer in a general register */
Out1 = _ mm_cvtsi64_si32 (m2 );
/* Move the 32-bit integer to the right and place it in the General Register */
M2 = _ mm_slli_pi32 (M2, 32 );
Out2 = _ mm_cvtsi64_si32 (m2 );
/* Clear MMX status */
_ Mm_empty ();
/* Add two 32-digit digits and the result is 8 */
Out1 + = out2;
Printf ("A: % d" N ", out1 );
Return (0 );
}

Notes:

Even if you are not on the P4 platform, use the following options during compilation,
/*gcc -Wall -march=pentium4 -mmmx -o ins mmx_ins.c*/
Otherwise, the following similar information will appear:
...xmmintrin.h:34:3: #error "SSE instruction set not enabled"
The final result does not actually obtain the sum of the four pairs, but only the first two pairs. instrinsic _ mm_cvtsi32_si64 only places a low 32-bit value into the MM register, and the high 32-bit value is zero, however, the MMX command movq can be used for 64-bit data transmission, and the intrinsic command does not correspond. This also means that not all commands have the equivalent intrinsic.
When the calculated vector is two pairs of 0x8000, 0x8000, namely (-2 ^ 15) * (-2 ^ 15) + (-2 ^ 15) * (-2 ^ 15), the result should be 2 ^ 31, but the calculated value is-2 ^ 31, which is unknown due to overflow. This is especially important when MMX is used. Computing overflow does not have any flag indication, and a large value changes to a very small value. SSE has improved this.
When the program no longer uses MMX, use the Emms command to clear the MMX status.

Use built-in operations... GCC built-in Operation

What is a built-in operation? It is to treat MMX operands, such as int, float, and other basic data types. There are corresponding defined operations, such as addition (+), subtraction (-), or conversion between data types. For more information, see the section GNU gcc manual [5] extensions to the C language family4 # 4built-in functions4 #4x86 built-in functions.

Some MMX commands have corresponding built-in operations. The following code is used as an example:

# Include <stdio. h>
/* Special header files are not required. Built-in */
/* Gcc-wall-O bins builtinmmx. C */
/* Defines a vector data type. Hi indicates 16 bits, and 4 indicates 4 */
Typedef int v4hi _ attribute _ (mode (v4hi )));
/* Defines two 32-bit vector types. Si indicates 32-bit */
Typedef int v2si _ attribute _ (mode (v2si )));
Int main (INT argc, char * argv [])
{
Short pa [4] = {0x8000, 0x8000, 1,-1 };
Short Pb [4] = {0x8000, 0x7fff,-1,-2 };

V4hi va, VB;
V4hi vsum;

Va = (v4hi *) Pa) [0];
VB = (v4hi *) Pb) [0];

/* 4 16-bit saturated addition */
// Vsum = _ builtin_ia32_paddsw (va, VB );
/* Four 16-bit values can be directly added, but different from the addition of two long values */
Vsum = va + VB;

/* The output of the vector must also be forcibly converted to long */
Printf ("... with MMX instructions... to compute vec_add: % LLX" N ", (long) vsum );

// Result 1: 0xfffd1_ffff8000
// Result 2: 0xfffd0000ffff0000

Return (0 );
}

Notes:

Yes, here built-in vector and its operations are strengthening with the development of GCC. If you need to use the above example, you should use GCC 3.4 or later;
When using the builtin function, it is similar to intrinsic, but it is essentially different. Here the two vectors use the '+' operation to show that the vector is also similar to other data types, which is directly supported by the compiler, however, the addition here refers to the addition of the numbers of four units. The carry of the low unit does not affect the data of the adjacent high unit;
Vector can also be forcibly converted to common data.

Back to Top

Embedded Assembly... inline ASM

At the beginning, GCC allowed ASM commands to be embedded in C code, not just for MMX commands, but for MMX technology, it is obviously a good method to use, for detailed syntax, see the GNU gcc manual [5] or the GCC: the complete reference [6] ''inline assemblies. The following is an example of dot product:

# Include <stdio. h>
/** Gcc-O ins inlinemmx. c **/
Int main (INT argc, char * argv [])
{
Int I;
Int result;
Short a [] = {1, 2, 3, 4, 5, 6, 7, 8 };
Short B [] = {1, 1, 1, 1, 1, 1, 1 };
Printf ("... with MMX instructions..." N ");

/* First, the dot-sum cumulative register is cleared. The default value is 0? */
ASM ("pandn % MM5, % mm5 ;"::);
/* Read A, B, and multiply every four logarithm, and then add them in two groups to form two groups and */
/* Here the loop control is implemented by C */
For (I = 0; I <sizeof (a)/sizeof (short); I + = 4 ){
ASM ("movq % 0, % mm0 ;"
Movq % 1, % MM1 ;"
Pmaddwd % MM1, % mm0 ;"
Paddd % mm0, % mm5; # multiply and add"
:
: "M" (a [I]), "M" (B [I]);
}
/* Separate the two groups and add them */
ASM ("movq % MM5, % mm0 ;"
Psrscsi $32, % mm5 ;"
Paddd % mm0, % mm5 ;"
Movd % MM5, % 0 ;"
Emms"
: "= R" (result)
:);
Printf ("Result: 0x % x" N ", result );
// The result is 0x24.
Return (0 );
}

Notes:

Here is a typical function of C and Assembly mixed programming;
Note the sequence of the operands in the assembly instruction;
Here, you can directly use commands such as movq without intrinsics/built-in;
Note that do not add comments among the ASM command sequence, which may lead to incorrect generated code.

MMX practical example: Synthetic filter... synthesis filter in x86 SIMD instructions

The following is an optimization process of the synthesis filter. The synthesis filter is widely used in speech codec and takes a high proportion of time in the entire algorithm during operation.

For (I = 0; I <LG; I ++)
{
S = Rochelle mult (X [I], a [0]);/* Rochelle mult is multiplied and shifted left */
For (j = 1; j <= m; j ++) {/* m is fixed as 10 */
S = l_msu (s, A [J], YY [-J]);/* Rochelle MSU is the left-shift operation after multiplication and subtraction */
}

S = l_shl (s, 3);/* move three places left */
* YY ++ = g729round (s );
}
# Endif

The above code, because the memory cycle is 10, you can consider expanding and unified operations for the multiplication and addition commands.

/* To use the multiplication and addition operation, you need to adjust the order of 10 coefficients */
For (I = 0; I <m; I ++)
Ta [I] =-A [M-I];
Ta [11] = 0;
Ta [10] = A [0];
For (I = 0; I <LG; I ++ ){
* YY = x [I];
YY [1] = 0;
S = l_mac (S, TA [11], YY [1]);
S = l_mac (S, TA [10], YY [0]);
S = l_mac (S, TA [9], YY [-1]);
S = l_mac (S, TA [8], YY [-2]);
S = l_mac (S, TA [7], YY [-3]);
S = l_mac (S, TA [6], YY [-4]);
S = l_mac (S, TA [5], YY [-5]);
S = l_mac (S, TA [4], YY [-6]);
S = l_mac (S, TA [3], YY [-7]);
S = l_mac (S, TA [2], YY [-8]);
S = l_mac (S, TA [1], YY [-9]);
S = l_mac (S, TA [0], YY [-10]);

S = l_shl (s, 3 );
* YY ++ = g729round (s );
}

The above cyclic kernel can utilize all the eight registers of MMX.

/* To use the multiplication and addition operation, you need to adjust the order of 10 coefficients */
For (I = 0; I <m; I ++)
Ta [I] =-A [M-I];
Ta [11] = 0;
Ta [10] = A [0];
/* Put the 11 coefficients into three MMX registers respectively, and fill the 0 values */
ASM ("movq % 0, % mm0 ;"
Movq % 1, % MM1 ;"
Movq % 2, % mm2 ""
:"
: "M" (TA [0]), "M" (TA [4]), "M" (TA [8]);

/* Use MMX technology to perform core Filter Operations */
For (I = 0; I <LG; I ++ ){
* YY = x [I];
YY [1] = 0;
ASM ("pandn % mm6, % mm6 ;"
Movq % 1, % mm3 ;"
Movq % 2, % mm4 ;"
Movq % 3, % mm5 ;"
Pmaddwd % mm0, % mm3 ;"
Pmaddwd % MM1, % mm4 ;"
Pmaddwd % mm2, % mm5 ;"
Paddd % mm3, % mm6 ;"
Paddd % mm4, % mm6 ;"
Paddd % MM5, % mm6 ;"
Movq % mm6, % mm7 ;"
Psrscsi $32, % mm6 ;"
Paddd % mm7, % mm6 ;"
Movd % mm6, % 0 ;"
Emms"
:
: "R" (s), "M" (yy [-10]), "M" (yy [-6]), "M" (yy [-2]);
/* Because of the limitation of the saturation attribute of the command result, s has not been shifted left, so do one more saturation left shift below */
S = l_shl (S, 4 );
* YY ++ = g729round (s );
}

Notes:

Note: The output result s of the preceding embedded assembly code is placed in the input field, which is a case in practice;
MMX does not have DSP commands such as multiplication left shift, or even operations such as saturation, and SSE has been enhanced;
In theory, the above operations may overflow. Therefore, the original saturated left shift operation is used to reduce certain risks;
The preceding Code operations clearly allow parallel operations, which are very useful in the VLIW system;
This has formed the core of the Comprehensive Optimization of the filter.

Summary... conclusion

If you are willing to make full use of the SIMD technology, you may need to use more assembly-level encoding. However, there are also some advanced languages and assembly-based hybrid programming technologies that can help you, some of them have higher performance, some form is more elegant, and their efficiency is also good in nature. They are all good methods. We suggest you try them.

This is exactly the case. On the one hand, the CPU supports more and more SIMD Instruction Set extensions, and on the other hand, GCC is stepping up to support these extensions for ease of use, GCC 3.4.1 is used here, and the effect is good based on experience.

About documentation

Application of SIMD commands in GCC

This document was generated using the latex2html translator version 2002 (1.62)

Copyright 1993,199 4, 1995,199 6, Nikos drakos, Computer Based Learning Unit, University of Leeds.
Copyright, 1998,199 9, Ross Moore, mathematics department, Macquarie University, sysydney.

The command line arguments were: latex2html-iso_language CN-html_version 4.0, Unicode-address '2004 coreup designs '-local_icons-split 0-nonavigation gccsimd

The translation was initiated by on 2004-12-13

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More