Use GPU universal parallel computing to draw a manderberet set image

Source: Internet
Author: User
Tags intel core i7

In the previous article, we used DirectX Compute Shader to write a parallel algorithm on the video card to calculate the good-looking mandeberot set iterations. What are the advantages of using a video card for general computing? This is a comparison. First, we implement this algorithm on the CPU. For convenience, we designed a class:

class CPUCalc{private:    int m_stride;    int m_width;    int m_height;    float m_realMin;    float m_imagMin;    float m_scaleReal;    float m_scaleImag;    unsigned char* m_pData;    void CalculatePoint(unsigned int x, unsigned int y);public:    CPUCalc(int stride, int width, int height, float rmin, float rmax, float imin, float imax, unsigned char* pData)         : m_stride(stride), m_width(width), m_height(height), m_realMin(rmin), m_imagMin(imin), m_scaleReal(0), m_scaleImag(0), m_pData(pData)     {        m_scaleReal = (rmax - rmin) / width;        m_scaleImag = (imax - imin) / height;    }    void Calculate();};

Data in the Constant Buffer in HLSL code is now placed in the member of the class. Note that this class can be used to calculate the mandeberot set of the Custom Complex Plane interval. Rmin and rmax indicate the real axis range of the complex plane, while imin and imax indicate the virtual axis range of the complex plane. The meanings of these parameters are the same as those used in the previous HLSL. If you want to implement this program, refer.

The following is the implementation of the class. We use almost the same way as HLSL. Some built-in HLSL methods are replaced by similar C ++ implementations:

#include <algorithm>#include <math.h>#include "CPUCalc.h"using std::max;typedef unsigned int uint;const uint MAX_ITER = 4096;struct float2{    float x;    float y;};inline float smoothstep(const float minv, const float maxv, const float v){    if (v < minv)         return 0.0f;    else if (v > maxv)        return 1.0f;    else        return (v - minv) / (maxv - minv);}inline uint ComposeColor(uint index){        if (index == MAX_ITER) return 0xff000000;    uint red, green, blue;        float phase = index * 3.0f / MAX_ITER;    red = (uint)(max(0.0f, phase - 2.0f) * 255.0f);    green = (uint)(smoothstep(0.0f, 1.0f, phase - 1.3f) * 255.0f);    blue = (uint)(max(0.0f, 1.0f - abs(phase - 1.0f)) * 255.0f);        return 0xff000000 | (red << 16) | (green << 8) | blue;}void CPUCalc::CalculatePoint(uint x, uint y){    float2 c;    c.x = m_realMin + (x * m_scaleReal);    c.y = m_imagMin + ((m_width - y) * m_scaleImag);        float2 z;    z.x = 0.0f;    z.y = 0.0f;        float temp, lengthSqr;    uint count = 0;        do    {        temp = z.x * z.x - z.y * z.y + c.x;        z.y = 2 * z.x * z.y + c.y;        z.x = temp;                lengthSqr = z.x * z.x + z.y * z.y;        count++;    }    while ((lengthSqr < 4.0f) && (count < MAX_ITER));            //write to result    uint currentIndex = x * 4 + y * m_stride;    uint& pPoint = *reinterpret_cast<uint*>(m_pData + currentIndex);        pPoint = ComposeColor(static_cast<uint>(log((float)count) / log((float)MAX_ITER) * MAX_ITER));}void CPUCalc::Calculate(){    #pragma omp parallel for    for (int y = 0; y < m_height; y++)        for (int x = 0; x < m_width; x++)        {            CalculatePoint(x, y);        }}

Finally, we added a program for the driver operation: Calculate () member function. It uses the OpenMP command (# pragma omp ). OpenMP is an extension of C ++ to Implement Parallel Algorithms for unified address spaces. Here we use a static task allocation method to convert for into multiple threads for parallel execution. This method has a disadvantage for the manderberot set. We will discuss it in detail later.

There are three main factors affecting the program performance: 1. Output pixel size (width and height control in the parameter); 2. Maximum number of iterations (controlled by the constant MAX_ITER ); 3. The selected compound plane area (the rmin, rmax, imin, and imax parameters are controlled ). The complexity of the algorithm cannot be determined because the iterations of each point in the compound plane are different. It is an O (N) algorithm with a large coefficient. In this test, the fixed range of the selected complex plane is the range of the real number axis [-1.101,-1.099] and the virtual number axis [2.229i, 2.231i. Its graph is the group of graphs that last demonstrate the number of iterations in the previous article. This interval has a considerable amount of computation. Then we fixed the maximum number of iterations and the number of outputs respectively, and changed another parameter for multiple measurements to compare the performance of CPU and GPU operations.

The CPU used in this test is Intel Core i7 920, with four cores, default clock speed 2.66 GHz, with 6 GB DDR3-1333 memory. It also has hyper-Threading Technology and can run 8 threads at the same time. The GPU is an AMD Ati HD5850 video card. The default frequency is 725 MHz (this super frequency is 775 MHz) with 1 GB 1250 MHz GDDR5 video memory. The new video card has 18 groups of SIMD processors, with a total of 1440 Stream Core computing units.

First, we set the maximum number of iterations to 512, and then output images with pixels 512x512, 1024x1024, 2048x2048, 4096x4096, and 16384x16384 in sequence. The following is the result (unit: milliseconds ):

Output Pixel CPU score GPU score Speed Ratio (G: C)
512x512 213 23 9.26
1024x1024 635 83 7.65
2048x2048 2403 312 7.70
4096x4096 9279 1227 7.56
8192x8192 37287 4894 7.61
16384x16384 152015 35793 4.24

Charts of the first five data items:

We can see that GPU has a huge performance advantage. Even a core i7 processor running with eight threads cannot be compared to a GPU at all. We can also observe the fact that GPU and CPU slows down at a similar rate when pixels increase. GPU: the CPU speed is about 7.6 times faster than in a large range. When the number of GPUs increases to 16384x16384, the GPU performance suddenly drops and the speed ratio drops to 4.24. This is because the graphics card memory is insufficient in this case, and the CPU allocates memory for the graphics card for virtual memory usage. This process consumes a lot of time, resulting in significant damage to the performance of the video card. The memory size is very important when processing large data. Therefore, both nVidia and AMD have launched general computing dedicated graphics cards (Tesla and FireStream). A major feature of these dedicated computing cards is their larger memory.

Next, we fixed the number of outputs to 4096x4096, and then tested the different maximum iterations. From 512 to 16384. The following is the test result (unit: milliseconds ):

Maximum iterations CPU score GPU score Speed Ratio (G: C)
512 9363 1247 7.51
1024 18042 1513 11.92
2048 35132 2058 17.07
4096 68398 2897 23.61
8192 135074 4347 31.07
16384 266547 7082 37.63

 

Charts of the first five data items:

This time we see the surprising advantage of GPU. When the number of iterations increases to 16384, the GPU is 37 times faster than the CPU! Why? In addition to the strong parallel computing of the GPU itself, our CPU algorithm also has a problem. Not every vertex can reach the maximum number of iterations. A considerable number of points have been calculated before the maximum number of iterations. If we evenly distribute the points in the recovery plane to multiple threads, some threads will be computed first, and some threads will be computed later. If we observe the CPU usage during the operation, we will find that the CPU cannot be fully utilized by 100% in about half of the time:

This is even more obvious when the number of iterations increases. While our tasks in GPU are very fine (each thread is only responsible for the calculation of one coordinate point), DirectX implements dynamic thread allocation, this allows the GPU to allocate new threads for computation once there are idle computation units. Therefore, our CPU program suffered a huge loss. However, if we re-compile the CPU algorithm and adopt Dynamic Allocation, We can significantly improve the CPU utilization and narrow this gap. This task is left for the reader to complete. Try again if you can write the manderberot set algorithm that keeps the CPU as much as possible by 100%.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.