Sorting out single instruction multi-data stream computing (SIMD) in advanced languages

Source: Internet
Author: User

Hou sisong compiled a very good article titled single-instruction multi-data stream computing (SIMD) in advanced languages. I will further sort out the content so that it can be accepted more easily.

We know that the instruction set architecture of traditional computers mainly implements basic functions such as arithmetic logic computing, conditional branch, and I/O access. However, in many fields, these basic commands are not enough to complete high-density computing. Therefore, DSP, stream processor, and other high-performance computing processors are born. In 1990s, Intel developed MMX (Multimedia Instruction Set extension) to accelerate high-density computing in the multimedia application field. Later, the x86-based processor architecture brought about the 3D now of SSE and AMD! (+), Sse2, sse3, ssse3 (only supported by Intel), sse4.1, sse4.2, sse4a (only supported by AMD), avx and other more advanced SIMD Instruction Set extensions. SIMD means "single command and multiple data". That is to say, one command can allow multiple data records to be processed independently and concurrently. For example, we can add four pairs of integer data at the same time, but only need one instruction cycle or even less.

In addition to the x86 system, other traditional processor architectures also introduce SIMD Instruction Set extensions, such as the power architecture altivec, the neon technology introduced in armv7 architecture, the MIPS-3D introduced by mips architecture and so on.

However, many low-end processors do not have a specific SIMD instruction set. At this time, we can use some specific algorithms to simulate the effects of SIMD at a relatively small cost. Next, we will take a 32-bit processor as an example to introduce some algorithms.

Scenario 1: logically shift a group of 8-bit integer data to the right. The general practice is:

extern void naive1(void);void naive1(void){    // Initialize    __attribute__((aligned(4))) unsigned char buffer[4] = { 255, 128, 11, 6 };        // Calculate    for(int i = 0; i < 4; i++)        buffer[i] >>= 1;        // Output results    printf("Results: ");    for(int i = 0; i < 4; i++)        printf("0x%.2X  ", buffer[i]);    puts("");}

The SIMD-like algorithm of the above algorithm is provided below:

extern void simd_test1(void);void simd_test1(void){    // Initialize    unsigned buffer = 0x060b80ff;        // Calculate    buffer &= 0xfefefefe;    buffer >>= 1;        // Output results    printf("The result is: 0x%.8X\n", buffer);}

We can see that the original version needs to be stored in a single byte for four times and then shifted. The optimized version like SIMD only performs a 4-byte access operation, then the two-step arithmetic logic calculation is complete, which is very refined.

Of course, from this example we can extend to the Left shift. Next, we will shift the extended arithmetic to the right, which is not in the original text. first look at the original algorithm:

extern void naive1(void);void naive1(void){    // Initialize    __attribute__((aligned(4))) char buffer[4] = { -1, -128, 11, 6 };        // Calculate    for(int i = 0; i < 4; i++)        buffer[i] >>= 1;        // Output results    printf("Results: ");    for(int i = 0; i < 4; i++)        printf("0x%.2X  ", buffer[i]);    puts("");}

Let's take a look at the SIMD optimized version:

extern void simd_test1(void);void simd_test1(void){    // Initialize    unsigned buffer = 0x060b80ff;        // Calculate    unsigned mask = buffer & 0x80808080;    buffer &= 0xfefefefe;    buffer >>= 1;    buffer |= mask;        // Output results    printf("The result is: 0x%.8X\n", buffer);}

Because the arithmetic shift to the right needs to consider the symbol bit. Therefore, we need to use a mask variable to store the symbolic values of the original data, and then shift the final result "or" the symbol bit. This requires two more steps than the logical right shift, but even so, the overhead is much lower than the memory access four times.

Scenario 2:

Perform a complement operation on an unsigned 8-bit integer. That is, 255-X (X is an unsigned 8-bit integer data ). The original algorithm is:

extern void naive2(void);void naive2(void){    // Initialize    __attribute__((aligned(4))) unsigned char buffer[4] = { 255, 128, 11, 6 };        // Calculate    for(int i = 0; i < 4; i++)        buffer[i] = 255 - buffer[i];        // Output results    printf("Results: ");    for(int i = 0; i < 4; i++)        printf("0x%.2X  ", buffer[i]);    puts("");}

Optimized version:

extern void simd_test2(void);void simd_test2(void){    // Initialize    unsigned buffer = 0x060b80ff;        // Calculate    buffer = ~buffer;        // Output results    printf("The result is: 0x%.8X\n", buffer);}

This algorithm is relatively easy to understand. Because this operation itself is to take 0 ~ 255.

Scenario 3:

Calculate the arithmetic mean of two unsigned integers. There are two cases: one is to round down an integer when the sum of two numbers is an odd number, and the other is to round up.

Let's first take a look at the rounded down condition, that is, (x + y)/2:

The original algorithm is:

extern void naive3(void);void naive3(void){    // Initialize    __attribute__((aligned(4))) unsigned char buffer1[4] = { 255, 128, 11, 33 };    __attribute__((aligned(4))) unsigned char buffer2[4] = { 100, 129, 19, 55 };    __attribute__((aligned(4))) unsigned char dstBuffer[4];        // Calculate    for(int i = 0; i < 4; i++)        dstBuffer[i] = ((unsigned)buffer1[i] + (unsigned)buffer2[i]) / 2;        // Output results    printf("Results: ");    for(int i = 0; i < 4; i++)        printf("0x%.2X  ", dstBuffer[i]);    puts("");}

The SIMD-like optimization algorithm is as follows:

extern void simd_test3(void);void simd_test3(void){    // Initialize    unsigned buffer1 = 0x210b80ff;    unsigned buffer2 = 0x37138164;    unsigned dstBuffer;        // Calculate    dstBuffer = (buffer1 & buffer2) + (((buffer1 ^ buffer2) & 0xfefefefe) >> 1);        // Output results    printf("The result is: 0x%.8X\n", dstBuffer);}

This algorithm is derived from FFMPEG, which is very clever.

Then, let's take a look at the rounded up algorithm, that is, sum = (x + y); Result = sum/2; If (sum % 1! = 0) Result + = 1;

The original algorithm is:

extern void naive3(void);void naive3(void){    // Initialize    __attribute__((aligned(4))) unsigned char buffer1[4] = { 255, 128, 11, 33 };    __attribute__((aligned(4))) unsigned char buffer2[4] = { 100, 129, 19, 55 };    __attribute__((aligned(4))) unsigned char dstBuffer[4];        // Calculate    for(int i = 0; i < 4; i++)    {        unsigned sum = (unsigned)buffer1[i] + (unsigned)buffer2[i];        unsigned char result = sum / 2;        dstBuffer[i] = (sum & 1) == 0? result : result + 1;    }        // Output results    printf("Results: ");    for(int i = 0; i < 4; i++)        printf("0x%.2X  ", dstBuffer[i]);    puts("");}

The SIMD-like optimization algorithm is as follows:

extern void simd_test3(void);void simd_test3(void){    // Initialize    unsigned buffer1 = 0x210b80ff;    unsigned buffer2 = 0x37138164;    unsigned dstBuffer;        // Calculate    dstBuffer = (buffer1 | buffer2) - (((buffer1 ^ buffer2) & 0xfefefefe) >> 1);        // Output results    printf("The result is: 0x%.8X\n", dstBuffer);}

Scenario 4:

Mix two bytes in proportion, that is, DST = (A * (255-S) + B * s)/255:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.