In the previous article, we used DirectX Compute Shader to write a parallel algorithm on the video card to calculate the good-looking mandeberot set iterations. What are the advantages of using a video card for general computing? This is a comparison. First, we implement this algorithm on the CPU. For convenience, we designed a class:
class CPUCalc{private: int m_stride; int m_width; int m_height; float m_realMin; float m_imagMin; float m_scaleReal; float m_scaleImag; unsigned char* m_pData; void CalculatePoint(unsigned int x, unsigned int y);public: CPUCalc(int stride, int width, int height, float rmin, float rmax, float imin, float imax, unsigned char* pData) : m_stride(stride), m_width(width), m_height(height), m_realMin(rmin), m_imagMin(imin), m_scaleReal(0), m_scaleImag(0), m_pData(pData) { m_scaleReal = (rmax - rmin) / width; m_scaleImag = (imax - imin) / height; } void Calculate();}; |
Data in the Constant Buffer in HLSL code is now placed in the member of the class. Note that this class can be used to calculate the mandeberot set of the Custom Complex Plane interval. Rmin and rmax indicate the real axis range of the complex plane, while imin and imax indicate the virtual axis range of the complex plane. The meanings of these parameters are the same as those used in the previous HLSL. If you want to implement this program, refer.
The following is the implementation of the class. We use almost the same way as HLSL. Some built-in HLSL methods are replaced by similar C ++ implementations:
#include <algorithm>#include <math.h>#include "CPUCalc.h"using std::max;typedef unsigned int uint;const uint MAX_ITER = 4096;struct float2{ float x; float y;};inline float smoothstep(const float minv, const float maxv, const float v){ if (v < minv) return 0.0f; else if (v > maxv) return 1.0f; else return (v - minv) / (maxv - minv);}inline uint ComposeColor(uint index){ if (index == MAX_ITER) return 0xff000000; uint red, green, blue; float phase = index * 3.0f / MAX_ITER; red = (uint)(max(0.0f, phase - 2.0f) * 255.0f); green = (uint)(smoothstep(0.0f, 1.0f, phase - 1.3f) * 255.0f); blue = (uint)(max(0.0f, 1.0f - abs(phase - 1.0f)) * 255.0f); return 0xff000000 | (red << 16) | (green << 8) | blue;}void CPUCalc::CalculatePoint(uint x, uint y){ float2 c; c.x = m_realMin + (x * m_scaleReal); c.y = m_imagMin + ((m_width - y) * m_scaleImag); float2 z; z.x = 0.0f; z.y = 0.0f; float temp, lengthSqr; uint count = 0; do { temp = z.x * z.x - z.y * z.y + c.x; z.y = 2 * z.x * z.y + c.y; z.x = temp; lengthSqr = z.x * z.x + z.y * z.y; count++; } while ((lengthSqr < 4.0f) && (count < MAX_ITER)); //write to result uint currentIndex = x * 4 + y * m_stride; uint& pPoint = *reinterpret_cast<uint*>(m_pData + currentIndex); pPoint = ComposeColor(static_cast<uint>(log((float)count) / log((float)MAX_ITER) * MAX_ITER));}void CPUCalc::Calculate(){ #pragma omp parallel for for (int y = 0; y < m_height; y++) for (int x = 0; x < m_width; x++) { CalculatePoint(x, y); }} |
Finally, we added a program for the driver operation: Calculate () member function. It uses the OpenMP command (# pragma omp ). OpenMP is an extension of C ++ to Implement Parallel Algorithms for unified address spaces. Here we use a static task allocation method to convert for into multiple threads for parallel execution. This method has a disadvantage for the manderberot set. We will discuss it in detail later.
There are three main factors affecting the program performance: 1. Output pixel size (width and height control in the parameter); 2. Maximum number of iterations (controlled by the constant MAX_ITER ); 3. The selected compound plane area (the rmin, rmax, imin, and imax parameters are controlled ). The complexity of the algorithm cannot be determined because the iterations of each point in the compound plane are different. It is an O (N) algorithm with a large coefficient. In this test, the fixed range of the selected complex plane is the range of the real number axis [-1.101,-1.099] and the virtual number axis [2.229i, 2.231i. Its graph is the group of graphs that last demonstrate the number of iterations in the previous article. This interval has a considerable amount of computation. Then we fixed the maximum number of iterations and the number of outputs respectively, and changed another parameter for multiple measurements to compare the performance of CPU and GPU operations.
The CPU used in this test is Intel Core i7 920, with four cores, default clock speed 2.66 GHz, with 6 GB DDR3-1333 memory. It also has hyper-Threading Technology and can run 8 threads at the same time. The GPU is an AMD Ati HD5850 video card. The default frequency is 725 MHz (this super frequency is 775 MHz) with 1 GB 1250 MHz GDDR5 video memory. The new video card has 18 groups of SIMD processors, with a total of 1440 Stream Core computing units.
First, we set the maximum number of iterations to 512, and then output images with pixels 512x512, 1024x1024, 2048x2048, 4096x4096, and 16384x16384 in sequence. The following is the result (unit: milliseconds ):
Output Pixel |
CPU score |
GPU score |
Speed Ratio (G: C) |
512x512 |
213 |
23 |
9.26 |
1024x1024 |
635 |
83 |
7.65 |
2048x2048 |
2403 |
312 |
7.70 |
4096x4096 |
9279 |
1227 |
7.56 |
8192x8192 |
37287 |
4894 |
7.61 |
16384x16384 |
152015 |
35793 |
4.24 |
Charts of the first five data items:
We can see that GPU has a huge performance advantage. Even a core i7 processor running with eight threads cannot be compared to a GPU at all. We can also observe the fact that GPU and CPU slows down at a similar rate when pixels increase. GPU: the CPU speed is about 7.6 times faster than in a large range. When the number of GPUs increases to 16384x16384, the GPU performance suddenly drops and the speed ratio drops to 4.24. This is because the graphics card memory is insufficient in this case, and the CPU allocates memory for the graphics card for virtual memory usage. This process consumes a lot of time, resulting in significant damage to the performance of the video card. The memory size is very important when processing large data. Therefore, both nVidia and AMD have launched general computing dedicated graphics cards (Tesla and FireStream). A major feature of these dedicated computing cards is their larger memory.
Next, we fixed the number of outputs to 4096x4096, and then tested the different maximum iterations. From 512 to 16384. The following is the test result (unit: milliseconds ):
Maximum iterations |
CPU score |
GPU score |
Speed Ratio (G: C) |
512 |
9363 |
1247 |
7.51 |
1024 |
18042 |
1513 |
11.92 |
2048 |
35132 |
2058 |
17.07 |
4096 |
68398 |
2897 |
23.61 |
8192 |
135074 |
4347 |
31.07 |
16384 |
266547 |
7082 |
37.63 |
Charts of the first five data items:
This time we see the surprising advantage of GPU. When the number of iterations increases to 16384, the GPU is 37 times faster than the CPU! Why? In addition to the strong parallel computing of the GPU itself, our CPU algorithm also has a problem. Not every vertex can reach the maximum number of iterations. A considerable number of points have been calculated before the maximum number of iterations. If we evenly distribute the points in the recovery plane to multiple threads, some threads will be computed first, and some threads will be computed later. If we observe the CPU usage during the operation, we will find that the CPU cannot be fully utilized by 100% in about half of the time:
This is even more obvious when the number of iterations increases. While our tasks in GPU are very fine (each thread is only responsible for the calculation of one coordinate point), DirectX implements dynamic thread allocation, this allows the GPU to allocate new threads for computation once there are idle computation units. Therefore, our CPU program suffered a huge loss. However, if we re-compile the CPU algorithm and adopt Dynamic Allocation, We can significantly improve the CPU utilization and narrow this gap. This task is left for the reader to complete. Try again if you can write the manderberot set algorithm that keeps the CPU as much as possible by 100%.