From: http://hi.baidu.com/sige_online/blog/item/d8fdfffc8f0033f7fd037fac.html
Undoubtedly, the math library is the cornerstone of graphics programs and one of the keys to the efficiency of graphics program running. An excellent mathematical library can make the graphics program run more smoothly, or even hundreds of times faster. Sometimes replacing a division operation will multiply the efficiency, for example, replacing operator/in the vector by multiplying 1/OP /. Of course, the more advanced optimization is to use SIMD to optimize massive operations. This is the center of this article-SSE/sse2 optimization.
Before describing SSE/sse2 optimization, I will first introduce the general vector/matrix library structure. Of course, there is already a very good imath implementation in openexr. You can refer to the implementation details of the mathematical library.
In graphics programs, we often encounter vector operations, which are not directly supported by the Standard C ++ compiler, such as three-dimensional space vectors. The traditional C graphics program uses the "array + macro" implementation method:
Typedef float vector [3];
In the C ++ era, it is generally encapsulated:
Class Vector
{
PRIVATE:
Float x, y, z;
};
Then add common methods, such
Const float & X (void) const {
Return X;
} Standard vector algorithms such as inner product, outer product, unitization, length, and operation can all be encapsulated as member functions.
Vector operator + (const vector & A, const float & B ){
Return vector (A. X + B, A. Y + B, A. Z + B );
}
Similar mathematical libraries can be found in Open-Source graphics programs such as aqsis. However, these structures are not suitable for the SSE/sse2 optimization we will discuss next.
SSE-streaming SIMD extension is an x86 Extended Instruction Set that intel has added since piII. Before SSE, The x86 floating-point operations were completed by the stack FPU. People with some x86 assembly experience should not be unfamiliar with the complicated explain and FST commands. On the one hand, SSE enables floating-point operations to be completed by directly accessing registers like integer operations, such as add eax and EBX, bypassing the annoying stack, and introducing the SIMD concept. SIMD-Single Instruction Multiply Data, as its name implies, allows a single command to be executed on multiple data at the same time. This architecture was once very popular on the mainframe, large machines that frequently perform massive operations usually use a mathematical SIMD Virtual Machine to speed up processing. For example, if a group of data is executed to perform a transformation, the data size is huge, however, SIMD can optimize data storage and operations to reduce the overhead of certain context switches.
SIMD is supported on the hardware layer. To some extent, it is driven by the needs of the game, because more and more 3D games involve a large number of vector operations, the general floating point operation optimization can no longer meet the needs of this parallel operation, but directly supporting the SIMD operation on the instruction can further simplify the optimization of the Vector Operation and improve the instruction execution efficiency. SSE commands such as addps can execute four 32-bit floating point addition operations in parallel, with a latency of only 4 cycle. In contrast, the latency of the original FADD command to execute a 32-Bit Single-precision floating point addition has reached 3 cycle, and the latency of storage commands such as FST has not been calculated yet. (For details, refer to the instruction execution unit table below)
Apparently,SSE
It can bring great optimization to graphics programs
It is much higher than the integer-based MMX and the double-cell Single-precision floating point number 3 dnow !. However, SSE imposes demanding requirements on data organization. To maximize the power of SSE, we also need to align vector data to 16 bytes. If we are using a general three-part vector, it means that we have to waste 1/4 of storage space in exchange for speed. Of course, these 4 bytes can be used in many ways, but you must be very careful, because any operation will be applied to the four components at the same time.
To use SSE, you must first check whether your compiler supports the new instruction set. Vc6 SP6, vc.net,. NET 2003, ICL, GCC, and NASM all support the SSE instruction set. I recommend using ICL, which is the best optimized and generates the most compact and efficient commands. There are two ways to use SSE,First, compile the Assembly Code directly.
But it is difficult and requires some compilation experience;Second, use
SSE intrinsic
, A pseudo function call that directly uses the SSE command in C/C ++. In the core aspect of graph operations, such as the raytrace core, we recommend that you use assembler to greatly reflect SSE's advantages, use it in combination with x86 commands, and make full use of its parallelism. In most cases, intrinsic is recommended. It is highly readable and the compiler will replace the function call with the SSE command at the end, so that no embedded assembly code is required, it can also ensure code execution efficiency.