Hou sisong compiled a very good article titled single-instruction multi-data stream computing (SIMD) in advanced languages. I will further sort out the content so that it can be accepted more easily.
We know that the instruction set architecture of traditional computers mainly implements basic functions such as arithmetic logic computing, conditional branch, and I/O access. However, in many fields, these basic commands are not enough to complete high-density computing. Therefore, DSP, stream processor, and other high-performance computing processors are born. In 1990s, Intel developed MMX (Multimedia Instruction Set extension) to accelerate high-density computing in the multimedia application field. Later, the x86-based processor architecture brought about the 3D now of SSE and AMD! (+), Sse2, sse3, ssse3 (only supported by Intel), sse4.1, sse4.2, sse4a (only supported by AMD), avx and other more advanced SIMD Instruction Set extensions. SIMD means "single command and multiple data". That is to say, one command can allow multiple data records to be processed independently and concurrently. For example, we can add four pairs of integer data at the same time, but only need one instruction cycle or even less.
In addition to the x86 system, other traditional processor architectures also introduce SIMD Instruction Set extensions, such as the power architecture altivec, the neon technology introduced in armv7 architecture, the MIPS-3D introduced by mips architecture and so on.
However, many low-end processors do not have a specific SIMD instruction set. At this time, we can use some specific algorithms to simulate the effects of SIMD at a relatively small cost. Next, we will take a 32-bit processor as an example to introduce some algorithms.
Scenario 1: logically shift a group of 8-bit integer data to the right. The general practice is:
extern void naive1(void);void naive1(void){ // Initialize __attribute__((aligned(4))) unsigned char buffer[4] = { 255, 128, 11, 6 }; // Calculate for(int i = 0; i < 4; i++) buffer[i] >>= 1; // Output results printf("Results: "); for(int i = 0; i < 4; i++) printf("0x%.2X ", buffer[i]); puts("");}
The SIMD-like algorithm of the above algorithm is provided below:
extern void simd_test1(void);void simd_test1(void){ // Initialize unsigned buffer = 0x060b80ff; // Calculate buffer &= 0xfefefefe; buffer >>= 1; // Output results printf("The result is: 0x%.8X\n", buffer);}
We can see that the original version needs to be stored in a single byte for four times and then shifted. The optimized version like SIMD only performs a 4-byte access operation, then the two-step arithmetic logic calculation is complete, which is very refined.
Of course, from this example we can extend to the Left shift. Next, we will shift the extended arithmetic to the right, which is not in the original text. first look at the original algorithm:
extern void naive1(void);void naive1(void){ // Initialize __attribute__((aligned(4))) char buffer[4] = { -1, -128, 11, 6 }; // Calculate for(int i = 0; i < 4; i++) buffer[i] >>= 1; // Output results printf("Results: "); for(int i = 0; i < 4; i++) printf("0x%.2X ", buffer[i]); puts("");}
Let's take a look at the SIMD optimized version:
extern void simd_test1(void);void simd_test1(void){ // Initialize unsigned buffer = 0x060b80ff; // Calculate unsigned mask = buffer & 0x80808080; buffer &= 0xfefefefe; buffer >>= 1; buffer |= mask; // Output results printf("The result is: 0x%.8X\n", buffer);}
Because the arithmetic shift to the right needs to consider the symbol bit. Therefore, we need to use a mask variable to store the symbolic values of the original data, and then shift the final result "or" the symbol bit. This requires two more steps than the logical right shift, but even so, the overhead is much lower than the memory access four times.
Scenario 2:
Perform a complement operation on an unsigned 8-bit integer. That is, 255-X (X is an unsigned 8-bit integer data ). The original algorithm is:
extern void naive2(void);void naive2(void){ // Initialize __attribute__((aligned(4))) unsigned char buffer[4] = { 255, 128, 11, 6 }; // Calculate for(int i = 0; i < 4; i++) buffer[i] = 255 - buffer[i]; // Output results printf("Results: "); for(int i = 0; i < 4; i++) printf("0x%.2X ", buffer[i]); puts("");}
Optimized version:
extern void simd_test2(void);void simd_test2(void){ // Initialize unsigned buffer = 0x060b80ff; // Calculate buffer = ~buffer; // Output results printf("The result is: 0x%.8X\n", buffer);}
This algorithm is relatively easy to understand. Because this operation itself is to take 0 ~ 255.
Scenario 3:
Calculate the arithmetic mean of two unsigned integers. There are two cases: one is to round down an integer when the sum of two numbers is an odd number, and the other is to round up.
Let's first take a look at the rounded down condition, that is, (x + y)/2:
The original algorithm is:
extern void naive3(void);void naive3(void){ // Initialize __attribute__((aligned(4))) unsigned char buffer1[4] = { 255, 128, 11, 33 }; __attribute__((aligned(4))) unsigned char buffer2[4] = { 100, 129, 19, 55 }; __attribute__((aligned(4))) unsigned char dstBuffer[4]; // Calculate for(int i = 0; i < 4; i++) dstBuffer[i] = ((unsigned)buffer1[i] + (unsigned)buffer2[i]) / 2; // Output results printf("Results: "); for(int i = 0; i < 4; i++) printf("0x%.2X ", dstBuffer[i]); puts("");}
The SIMD-like optimization algorithm is as follows:
extern void simd_test3(void);void simd_test3(void){ // Initialize unsigned buffer1 = 0x210b80ff; unsigned buffer2 = 0x37138164; unsigned dstBuffer; // Calculate dstBuffer = (buffer1 & buffer2) + (((buffer1 ^ buffer2) & 0xfefefefe) >> 1); // Output results printf("The result is: 0x%.8X\n", dstBuffer);}
This algorithm is derived from FFMPEG, which is very clever.
Then, let's take a look at the rounded up algorithm, that is, sum = (x + y); Result = sum/2; If (sum % 1! = 0) Result + = 1;
The original algorithm is:
extern void naive3(void);void naive3(void){ // Initialize __attribute__((aligned(4))) unsigned char buffer1[4] = { 255, 128, 11, 33 }; __attribute__((aligned(4))) unsigned char buffer2[4] = { 100, 129, 19, 55 }; __attribute__((aligned(4))) unsigned char dstBuffer[4]; // Calculate for(int i = 0; i < 4; i++) { unsigned sum = (unsigned)buffer1[i] + (unsigned)buffer2[i]; unsigned char result = sum / 2; dstBuffer[i] = (sum & 1) == 0? result : result + 1; } // Output results printf("Results: "); for(int i = 0; i < 4; i++) printf("0x%.2X ", dstBuffer[i]); puts("");}
The SIMD-like optimization algorithm is as follows:
extern void simd_test3(void);void simd_test3(void){ // Initialize unsigned buffer1 = 0x210b80ff; unsigned buffer2 = 0x37138164; unsigned dstBuffer; // Calculate dstBuffer = (buffer1 | buffer2) - (((buffer1 ^ buffer2) & 0xfefefefe) >> 1); // Output results printf("The result is: 0x%.8X\n", dstBuffer);}
Scenario 4:
Mix two bytes in proportion, that is, DST = (A * (255-S) + B * s)/255: