Use SSE instruction set to enhance floating point Operation Performance

Source: Internet
Author: User

SSE is a common X86 platform instruction set, which has emerged as early as the P4 era. Later, INTEL successively launched SSE2, SSE3, and SSE4 (but not SSE5, originally planned. Later, INTEL independently developed a new AVX instruction set to replace SSE, there is not much information about AVX, and SSE is not widely used. After all, there are not many CPUs that support AVX, such as my T4400 ).
If you don't talk much about it, you should try something practical. As we all know, the speed of floating-point operations is indeed very slow compared to integer operations. In many fields, such operations require a large number of floating-point operations. At this time, CPU is a significant bottleneck, to improve the floating point performance, we have two methods:
1. Convert floating point into integer type: Convert the original floating point type into integer type through a mathematical transformation.
2. Using SSE instruction sets: Obviously this method is the focus of this article, but method 1 will also be used.
Take a common image color to grayscale as an example.
According to some color theory, converting an RGB Color Pixel to a gray level is actually a 1*3 matrix multiplied by a 3*1 matrix. To put it bluntly, the process is as follows:
Set the original pixel to p0 = (r0, g0, b0) and convert it to s = (r0 * 0.3, g0 * 0.6, b0 * 0.1 ), then the new gray pixel p1 = (s, s, s ).
Here we can see that there are three floating-point operations in this step to obtain the s value. We can use method 1 to temporarily convert the floating-point operation here into an integer (multiply all by 10), that is
S = (r0 * 3, g0 * 6, b0 * 1). Divide the last time by 10.
The Code is as follows:
Void doProcess (PBYTE pIn, DWORD size, DWORD width, DWORD height, DWORD bitCount)
{
DWORD dwRGBSum = 0;
For (DWORD dwIndex = 0; dwIndex <size; dwIndex ++ = 3)
{
DwRGBSum =
1 * pIn [dwIndex + 0] + // Blue
6 * pIn [dwIndex + 1] + // Green
3 * pIn [dwIndex + 2]; // Red
DwRGBSum/= 10.0;
PIn [dwIndex + 0] = dwRGBSum;
PIn [dwIndex + 1] = dwRGBSum;
PIn [dwIndex + 2] = dwRGBSum;
}
}
Now let's use SSE for further optimization.
SSE can process 128-bit operations at a time, that is, four floating point numbers. Therefore, we put the four Division operations at one time. The core data structure is _ m128, which is a consortium. For details, see the source code.

In SSE, the C function corresponding to the batch floating point multiplication is _ mm_mul_ps. For usage instructions, refer to a PDF on the MSDN or INTEL official website.

Void doProcess (PBYTE pIn, DWORD size, DWORD width, DWORD height, DWORD bitCount)
{
UINT16 dwRGBSum0 = 0;
UINT16 dwRGBSum1 = 0;
UINT16 dwRGBSum2 = 0;
UINT16 dwRGBSum3 = 0;

For (DWORD idx = 0; idx <size; idx + = 12)
{
DwRGBSum0 =
1 * pIn [idx + 0] + // Blue
6 * pIn [idx + 1] + // Green
3 * pIn [idx + 2]; // Red

DwRGBSum1 =
1 * pIn [idx + 3] + // Blue
6 * pIn [idx + 4] + // Green
3 * pIn [idx + 5]; // Red

DwRGBSum2 =
1 * pIn [idx + 6] + // Blue
6 * pIn [idx + 7] + // Green
3 * pIn [idx + 8]; // Red

DwRGBSum3 =
1 * pIn [idx + 9] + // Blue
6 * pIn [idx + 10] + // Green
3 * pIn [idx + 11]; // Red


_ M128 old = _ mm_set_ps (dwRGBSum0, dwRGBSum1, dwRGBSum2, dwRGBSum3 );
_ M128 ret = _ mm_mul_ps (old, vec );

PIn [idx + 0] = pIn [idx + 1] = pIn [idx + 2] = (BYTE) ret. m128_f32 [3];
PIn [idx + 3] = pIn [idx + 4] = pIn [idx + 5] = (BYTE) ret. m128_f32 [2];
PIn [idx + 6] = pIn [idx + 7] = pIn [idx + 8] = (BYTE) ret. m128_f32 [1];
PIn [idx + 9] = pIn [idx + 10] = pIn [idx + 11] = (BYTE) ret. m128_f32 [0];
}
}
The code looks much more complex than the original one, but the principle is actually very simple. The original one-time processing of one pixel, now four one-time processing, performance and efficiency greatly improved.

This code can also be optimized because besides floating point processing and integer processing in SSE, the corresponding data structure is _ m128i. I will not talk about the Matrix.

The above optimization performance test results are still very obvious. The original program for a 2560*1600, 24-bit color deep image conversion requires nearly Ms. After optimization, the latency MS is only needed, which is nearly doubled.

This article is from the "Kevx's Blog" Blog

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.