Use SSE instruction set to enhance floating point Operation Performance

Last Update:2013-11-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SSE is a common X86 platform instruction set, which has emerged as early as the P4 era. Later, INTEL successively launched SSE2, SSE3, and SSE4 (but not SSE5, originally planned. Later, INTEL independently developed a new AVX instruction set to replace SSE, there is not much information about AVX, and SSE is not widely used. After all, there are not many CPUs that support AVX, such as my T4400 ).
If you don't talk much about it, you should try something practical. As we all know, the speed of floating-point operations is indeed very slow compared to integer operations. In many fields, such operations require a large number of floating-point operations. At this time, CPU is a significant bottleneck, to improve the floating point performance, we have two methods:
1. Convert floating point into integer type: Convert the original floating point type into integer type through a mathematical transformation.
2. Using SSE instruction sets: Obviously this method is the focus of this article, but method 1 will also be used.
Take a common image color to grayscale as an example.
According to some color theory, converting an RGB Color Pixel to a gray level is actually a 1*3 matrix multiplied by a 3*1 matrix. To put it bluntly, the process is as follows:
Set the original pixel to p0 = (r0, g0, b0) and convert it to s = (r0 * 0.3, g0 * 0.6, b0 * 0.1 ), then the new gray pixel p1 = (s, s, s ).
Here we can see that there are three floating-point operations in this step to obtain the s value. We can use method 1 to temporarily convert the floating-point operation here into an integer (multiply all by 10), that is
S = (r0 * 3, g0 * 6, b0 * 1). Divide the last time by 10.
The Code is as follows:
Void doProcess (PBYTE pIn, DWORD size, DWORD width, DWORD height, DWORD bitCount)
{
DWORD dwRGBSum = 0;
For (DWORD dwIndex = 0; dwIndex <size; dwIndex ++ = 3)
{
DwRGBSum =
1 * pIn [dwIndex + 0] + // Blue
6 * pIn [dwIndex + 1] + // Green
3 * pIn [dwIndex + 2]; // Red
DwRGBSum/= 10.0;
PIn [dwIndex + 0] = dwRGBSum;
PIn [dwIndex + 1] = dwRGBSum;
PIn [dwIndex + 2] = dwRGBSum;
}
}
Now let's use SSE for further optimization.
SSE can process 128-bit operations at a time, that is, four floating point numbers. Therefore, we put the four Division operations at one time. The core data structure is _ m128, which is a consortium. For details, see the source code.

In SSE, the C function corresponding to the batch floating point multiplication is _ mm_mul_ps. For usage instructions, refer to a PDF on the MSDN or INTEL official website.

Void doProcess (PBYTE pIn, DWORD size, DWORD width, DWORD height, DWORD bitCount)
{
UINT16 dwRGBSum0 = 0;
UINT16 dwRGBSum1 = 0;
UINT16 dwRGBSum2 = 0;
UINT16 dwRGBSum3 = 0;

For (DWORD idx = 0; idx <size; idx + = 12)
{
DwRGBSum0 =
1 * pIn [idx + 0] + // Blue
6 * pIn [idx + 1] + // Green
3 * pIn [idx + 2]; // Red

DwRGBSum1 =
1 * pIn [idx + 3] + // Blue
6 * pIn [idx + 4] + // Green
3 * pIn [idx + 5]; // Red

DwRGBSum2 =
1 * pIn [idx + 6] + // Blue
6 * pIn [idx + 7] + // Green
3 * pIn [idx + 8]; // Red

DwRGBSum3 =
1 * pIn [idx + 9] + // Blue
6 * pIn [idx + 10] + // Green
3 * pIn [idx + 11]; // Red

_ M128 old = _ mm_set_ps (dwRGBSum0, dwRGBSum1, dwRGBSum2, dwRGBSum3 );
_ M128 ret = _ mm_mul_ps (old, vec );

PIn [idx + 0] = pIn [idx + 1] = pIn [idx + 2] = (BYTE) ret. m128_f32 [3];
PIn [idx + 3] = pIn [idx + 4] = pIn [idx + 5] = (BYTE) ret. m128_f32 [2];
PIn [idx + 6] = pIn [idx + 7] = pIn [idx + 8] = (BYTE) ret. m128_f32 [1];
PIn [idx + 9] = pIn [idx + 10] = pIn [idx + 11] = (BYTE) ret. m128_f32 [0];
}
}
The code looks much more complex than the original one, but the principle is actually very simple. The original one-time processing of one pixel, now four one-time processing, performance and efficiency greatly improved.

This code can also be optimized because besides floating point processing and integer processing in SSE, the corresponding data structure is _ m128i. I will not talk about the Matrix.

The above optimization performance test results are still very obvious. The original program for a 2560*1600, 24-bit color deep image conversion requires nearly Ms. After optimization, the latency MS is only needed, which is nearly doubled.

This article is from the "Kevx's Blog" Blog

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use SSE instruction set to enhance floating point Operation Performance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use SSE instruction set to enhance floating point Operation Performance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support