3.8 floating point considerations
When programming floating point applications, it is best to start with an advanced programming language, such as C, C ++, or FORTRAN. Many compilers perform floating point scheduling and optimization as much as possible. However, the compiler may need some assistance to generate optimal code.
3.8.1 principles for optimizing floating-point code
User/source encoding Rule 13: Use an appropriate switch to allow the compiler to use the sse2 and sse3 commands.
Follow this process to investigate the performance of your floating point application:
● Understand how the compiler handles floating point code.
● Check the Assembly log to see what transformations have been executed on the program.
● Learning the nested loop in an application takes the execution time.
● Judge why the compiler has not created the fastest code.
● Check whether there is a dependency that can be resolved
● Determining the problem domain: bus bandwidth, cache location, Trace Cache bandwidth, or command delay. Focus on the optimization problem domain. For example, adding the prefetch command will not help if the bus is saturated. If the trace cache bandwidth is a problem, adding prefetch microoperations may reduce the performance.
At the same time, in general, follow the general coding suggestions discussed in this chapter, including:
● Block Cache
● Use prefetch
● Vectoring allowed
● Loop Expansion
User/source encoding rules 14: Make sure that your application is within the normal range to avoid abnormal values and overflow. A number that exceeds the range may cause a very high load.
User/source encoding Rule 15: Do not use double precision unless necessary. Set the control word (PC) to "single precision" in x87 FPU ". This allows single-precision (32-bit) computation to perform some operations (such as some division) more quickly ). However, be careful to introduce floating point control words that are more than two values in total. Otherwise, there will be a lot of performance penalties. See section 3.8.3.
User/source encoding Rule 16: Use QuickFloatToIntRoutines, fisttp, or sse2 commands. If these routines are encoded, The fisttp command is used. If sse3 is available, otherwise the cvttss2si and cvttsd2si commands are used, and sse2 is used for encoding.
The x87 code generated by many libraries does unnecessary work. The fisttp command in sse3 can be used to convert the floating point value to a 16-bit, 32-bit, or 64-bit integer without accessing the floating point control word (FCW. The command cvttss2si and cvttsd2si save a lot of micro operations as well as some storage transfer delays for some compiler implementations. This avoids changing the rounding mode.
User/source encoding Rule 17: Removing Data Dependencies allows unordered engines to extract more ILP from code. Partial summation is used when an array is summed, instead of a single accumulator.
For example, to calculate z = A + B + C + D, do not use:
X = A + B;
Y = x + C;
Z = Y + D;
Instead, use:
X = A + B;
Y = C + D;
Z = x + y;
User/source encoding Rule 18: Generally, the math library uses commands beyond [Translator's note: transcendental] (for example, FSIN) to calculate basic functions ). If there is no special need to use an extended 80-bit precision for CalculationBeyond Functions[Translator's note:Transcendental function], Then the application should consider a replacement, software-based approach, such as an algorithm-based search table using interpolation technology. It is possible to use these technologies to improve the performance of functions beyond functions by selecting the expected numerical precision, the size of the table to be searched, and the parallelism between SSE and sse2 commands.
3.8.2 floating point mode and exceptions
When processing floating-point numbers, the high-speed processor must frequently process some scenarios and perform some special processing in hardware or code.
3.8.2.1 floating point exception
The most common cause of performance degradation is the use of enabled floating-point exception conditions [Note: some flags with floating-point exception conditions are set], such:
● Arithmetic overflow
● Arithmetic overflow
● Non-normalized operands
Non-normalized floating point numbers affect performance in the following two ways:
● When used as the operand, the performance is directly affected.
● As a underflow scenario, time is generated to affect performance.
If a floating point application never overflows, the non-normalization can only come from floating point constants.
User/source encoding Rule 19: Non-normalized floating point constants should be avoided as much as possible.
Non-normalization and arithmetic underflow exceptions may occur when x87 or SSE/sse2/sse3 commands are executed. The intel netburst microarchitecture-based processor processes these exceptions more efficiently when executing SSE/sse2/sse3 instructions and when the speed is more important than complying with IEEE standards. The following section provides some suggestions on how to optimize the code to reduce performance degradation related to floating point exceptions.
3.8.2.2 handle floating point exceptions in x87 FPU code
Section 3.8.2.1 lists the high-cost "floating point exceptions" in terms of performance in each special case ". For that reason, the x87 FPU code should be rewritten to avoid these situations.
There are basically three ways to reduce the influence of the overflow/underflow of the x87 FPU code:
● Select a floating point data type that is large enough to accommodate the results without generating an Arithmetic overflow or overflow exception.
● Measure the operand/result range to minimize the number of times Arithmetic overflow/underflow occurs.
● Save the intermediate result on the x87 FPU register stack until the final result is calculated and then stored in the memory. When the result is stored in the x87 FPU stack, overflow or underflow is unlikely to occur (this is because the data overflow and expansion dual-precision formats on the stack are stored, and the overflow/underflow conditions can be detected accordingly ).
● Non-normalized floating point constants (they are read-only and will not change) should be avoided. If possible, they should be replaced by zero or identical symbols.
3.8.2.3 floating point exception in SSE/sse2/sse3 code
Most special cases that involve setting floating point exception flag are processed by hardware efficiently. When a set overflow exception occurs when the SSE/sse2/sse3 code is executed, the processor hardware can process it without any performance penalty.
Overflow exceptions and non-normalized source operands are usually handled according to the ieee754 manual, but this may cause serious performance latency. If the programmer is willing to sacrifice the ieee754 compatibility for speed, x87 provides two non-ieee754 compatibility modes for speed-based scenarios where overflow and input are frequent: FTZ mode and DAZ mode.
When the FTZ mode is enabled, an off-flow result is automatically converted to a zero with the correct symbol. Although this behavior is not compliant with ieee754. Because non-normalized results are not produced when the FTZ mode is allowed, only non-normalized floating point numbers in the FTZ mode may be encountered as the floating point number specified as a constant (read-only ).
The Daz mode is used to efficiently process non-class calls when running a SIMD floating point application. When the Daz mode is enabled, the input non-normalized number is regarded as zero with the same symbol. Allowing the Daz mode is a method for processing non-normalized floating point constants for the primary purpose of performance.
If you leave ieee754, it is acceptable and the performance is decisive, then run SSE/sse2/sse3 In the FTZ and DAZ permitted States.
Note: The Daz mode is available in SSE and sse2 extensions, although the desired speed improvement with this mode is only implemented in SSE code.
3.8.3 Floating Point Mode
On the Pentium III processor, the fldcw command is an expensive operation. In an earlier model of the Pentium 4 processor, fldcw is only enhanced when two constant values of the x87 FPU control word (FCW) Alternate, such as Integer Conversion. In Pentium M, Intel Core Duo, and Intel Core 2 Duo processors, fldcw has improved compared with previous processors.
Specifically, the optimization of fldcw in the first two generations of the Pentium 4 processor allows programmers to efficiently switch between two constant values. To make the fldcw optimization effective, the FCW values of the two constants are only allowed to be different from the following five bits in the FCW:
FCW [8-9]: Precision Control
FCW [10-11]: rounding mode
FCW [12]: Infinite Control
If the programmer needs to modify other bits in FCW (for example, blocking bits), The fldcw command is still an expensive operation.
In a scenario where an application loops between three (or more) constant values, fldcw command optimization will not be applied, and performance degradation will occur for each fldcw command.
One solution to this problem is to select two constant FCW values and use the optimization of the fldcw command to alternate only between the two constant FCW values, we also designed some methods to implement the task that requires the third FCW value, without actually changing the FCW to a third constant value. An alternative solution is to structure the code so that the application can only alternate between two constant FCW values at intervals. When the application later alternates between a pair of different FCW values, the program performance only decreases during this switchover.
The SIMD application is unlikely to switch between the FTZ and DAZ mode values, which is expected. As a result, the latency of the SIMD control word is longer than that of the floating point control register. One read of the mxcsr register has a considerable delay, and one write of the Register is a serial instruction.
For single precision and double precision, there is no separate control word; both use the same mode. It is worth noting that this applies to FTZ and DAZ modes.
Assembly/compiler encoding rules 60: Minimizes the change to the floating point control word to 8-12 bits. Changes to more than two values (each value is a combination of the following bits: precision, rounding, infinite control, and the remaining bits in FCW) lead to latency in the pipeline's deep order.
3.8.3.1 rounding mode
Many libraries provide library routines that convert floating point values to integers. Many of these libraries comply with the ansi c encoding standard, which states that the rounding mode should be truncated. On the Pentium 4 processor, you can use the cvttsd2si and cvttss2si commands to convert the operands without changing the rounding mode. The method of using these commands to save costs is sufficient to prove that using SSE and sse2 is reasonable and can be used whenever truncation is involved.
For the x87 floating point, the fist command uses the rounding mode represented by the floating point control word (FCW. The rounding mode usually uses "nearest rounding", so many compiler writers implement a change in the rounding mode in the processor to comply with the C and FORTRAN standards. This implementation requires the use of the fldcw command to change the control word on the processor. The fstcw command is used to store floating point control words for rounding, precision, and infinite bit changes. Then, use the fldcw command to change the rounding mode to truncation.
In a typical code sequence that changes the rounding mode of FCW, an fstcw command usually follows a loading operation. The loading operation from memory should be a 16-bit operation number to prevent storage transshipment problems. If an 8-bit or 32-bit operation is involved in loading the previously stored FCW, this will cause a storage transshipment problem, because the data size mismatch between the storage operation and the loading operation is incorrect.
To avoid storage transshipment problems, ensure that the write and read operations on FCW are both 16 bits.
If more than one bit is changed for rounding, precision, and infinity, and the rounding mode is not important to the result, the algorithm in Example 3-58 is used to avoid synchronization issues, the load of the fldcw command, and the rounding mode that has to be changed. Note that this example suffers from storage transshipment, which can cause performance penalty. However, its performance is still better than changing the rounding, precision, and infinite bit values.
Example 3-58: algorithm to avoid changing the rounding mode
_ Fto132proc Lea ECx, [esp-8] sub ESP, 16; allocating frames and ECx,-8; alignment of the pointer to the boundary of 8 fld st (0 ); copy the top fistp qword PTR [ECx] fild qword PTR [ECx] mov edX, [ECx + 4] of the FPU stack; the integer high DWORD mov eax, [ECx]; integer DWORD test eax, eax je integer_qnan_or_zerointeger_qnan_or_zero: fsubp ST (1), ST; ToS = D-round (D), {ST (1) = (ST (1) -St) & pop st} test edX, EDX; JNS positive to determine the integer; number: fstp dword ptr [ECx]; subtraction result: mov ECx, [ECx]; diff (single-precision) DWORD add ESP, 16 XOR ECx, 80000000 H add ECx, 7 fffffffh; If diff <0, then reduce the integer by 1 ADC eax, 0; INC eax (adding bits) retpositive: fstp dword ptr [eax]; subtraction 17-18 results mov ECx, [ECx]; diff (single precision) DWORD add ESP, 16 Add ECx, 7 fffffffh; If diff is <0, the integer is reduced by 1 SBB eax, 0; Dec eax (decimal sign) retinteger_qnan_or_zero: Test edX, 7 fffffh jnz arg_is_not_integer_qnan add ESP, 16
Assembly/compiler coding rules 61: Minimize the number of changes to the rounding mode. Do not use changes in the round-robin mode to achieve the bottom-Fetch and top-Fetch functions, if this involves two values that exceed the total number of round-robin, precision, and infinite bit sets.
3.8.3.2 precision
If the single precision is sufficient, the single precision is used instead of the double precision. This is true, because:
● Single-precision operations allow longer SIMD vectors, because more single-precision data elements can be adapted to a SIMD register.
● If the precision control (PC) field in the x87 FPU control word sets the single precision to 1, the floating-point splitter can calculate either a double precision or an extended double precision, it is much faster to complete a single precision calculation. If the PC domain is set to dual-precision, this will allow the x87 FPU operations on Dual-precision data to complete computation faster than the extended dual-precision computation. These features affect the calculation of floating point division and square root.
Assembly/compiler Encoding Rules 62: Minimize the number of changes to the precision mode.
3.8.3.3 improve the degree of parallelism and use fxch
The x87 Instruction Set depends on a floating-point stack for one of its operations. If the dependency graph is a tree, which means that each intermediate result is used only once and the code is carefully arranged, you can often only use the operands in the top stack or memory, to avoid using the operands buried in the top of the stack. When the operand needs to be pulled out from the middle of the stack, you can use an fxch command to switch the top of the stack and the other destination operand on the stack.
The fxch command can also be used to increase the degree of parallelism. You can stack dependency chains to expose more independent commands to the hardware scheduler. An fxch command may need to effectively register the namespace so that more operands can be simultaneously active.
However, in Intel netburst-based processors, fxch suppresses the release bandwidth in trace cache. Fxch does this not only because it consumes a slot, but also because it uses the restrictions of the fxch release slot. If the application is not bound to the published or retired bandwidth, fxch will not be affected.
The effective instruction window on Intel netburst microarchitecture processor is large enough to allow overlapping commands away from the next iteration. This often avoids the need to use fxch to increase the degree of parallelism.
The fxch command should be used only when it needs to express an algorithm or enhance the degree of parallelism. If the size of the register namespace is a problem, we recommend that you use the XMM register.
Compilation/compiler Encoding Rules 63: Use fxch only when you need to add a valid namespace.
This in turn allows commands to be rescheduled and available for parallel operations. Unordered execution prevents the necessity of using fxch to move a short distance.
3.8.4 x87 vs SIMD floating point trade-offs
There are some differences between the x87 floating point code and the scalar floating point code (using SSE and sse2. The following differences should drive the decision on the registers and commands to be used:
● An exception occurs when an input operand of a SIMD floating point instruction contains a value smaller than the value range of the data type that can be expressed. This leads to a very serious performance penalty. A simd operation has a zero erosion mode, in which the results do not overflow. As a result, the subsequent computation will not face performance penalty for processing Abnormal Input operands. For example, in the case of low-light 3D applications, the use of the zero-padding mode can improve the performance by as much as 50% compared to applications with a large amount of overflow.
● The scalar floating point SIMD command has less latency than the equivalent x87 command. The scalar SIMD floating-point multiplication command can be streamlined, while the x87 multiplication command cannot.
● Only x87 supports commands exceeding the limit.
● X87 supports 80-bit floating point and double-precision floating point extension. SSE supports a maximum of 32-bit precision. Sse2 supports a maximum of 64-bit precision.
● Scalar floating-point registers can be directly accessed to avoid fxch and stack top constraints.
● Using sse2 and sse in a netburst-based processor to convert floating point with truncation to an integer is much lower than changing the rounding mode or the code sequence previously described in Example 3-58.
Assembly/compiler encoding rules 64: Use sse2 or SSE unless you need x87 features. Most sse2 arithmetic operations have a shorter latency than x87, And they eliminate the load associated with x87 register stack management.
3.8.4.1 scalar SSE/sse2 performance on Intel Core solo and Intel Core Duo processors
On Intel Core solo and Intel Core Duo processors, the combination of improved decoding and micro-operation fusion allows the original instructions for two, three, and four micro-operations to go through all the decoders. As a result, the scalar SSE/sse2 code can match the performance of x87 code execution through two floating point units. On the Pentium M processor, the scalar SSE/sse2 code shows a performance reduction of about 30%, compared to the x87 code execution through two floating point units.
In a code sequence that converts a floating point to an integer, a single-precision division command, or a precision change, the x87 code generated by a compiler generally writes data to the memory in a single precision mode and reads it again to reduce precision. Using SSE/sse2 scalar code instead of x87 code can generate a huge performance benefit. Based on the intel netburst microarchitecture, in addition, it has moderate performance gains on Intel Core solo and Intel Core Duo processors.
Suggestions: Use the compiler switch to generate sse2 scalar floating point code instead of x87 code.
When writing the scalar SSE/sse2 code, note the need to clear unused slots in an XMM register and the related performance impact.
For example, loading data from the memory using movss or movsd causes an additional Microoperation to clear the high part of the XMM register.
In Pentium M, Intel Core solo, and Intel Core Duo processors, this penalty can be avoided by using movlpd. However, using movlpd on the Pentium 4 processor causes a performance penalty.
When the single-precision and double-precision code are mixed, another situation occurs. In intel netburst-based processors, using cvtss2sd has performance penalty against the following replaceable code sequences:
XORPS XMM1, XMM1MOVSS XMM1, XMM2CVTPS2PD XMM1, XMM1
In Intel Core solo and Intel Core Duo processors, using cvtss2sd is more desirable than this replacement sequence.
3.8.4.2 x87 floating point operation with integer operands
For intel netburst-based processors, the floating-point operations (fiadd, fisub, fimul, and fidiv) that use 16-bit integer operands are separated into two commands (fild and one floating-point operation) more efficient. However, for floating-point operations with 32-bit integer operands, using fiadd, fisub, fimul, and fidiv is equally efficient than using independent commands.
Assembly/compiler encoding rules 65: Try to use 32-bit operands instead of 16-bit operands for fild. However, this is not required for overhead that introduces a storage transshipment problem by writing half of two 32-Bit Memory operands separately.
3.8.4.3 x87 floating point comparison command
The fcomi and fcmov commands should be used when x87 floating point comparison is executed. The use of fcom, fcomp, and fcompp commands generally requires additional commands, such as fstsw. The latter replacement scheme causes more minor operations to be decoded and should be avoided.
3.8.4.4 beyond functions
If an application needs to use software to simulate mathematical functions for performance or other reasons (see section 3.8.1), it is worth calling the inline mathematical library, because this call and the preface/end involved in such a call will seriously affect the operation delay.
Note that exclusive functions are only supported in the x87 floating point, not in SSE or sse2.