Bo Master due to the needs of the work, began to learn the GPU above the programming, mainly related to the GPU based on the depth of knowledge, in view of the previous did not contact GPU programming, so here specifically to learn the GPU above programming. Have like-minded small partners, welcome to exchange and study, my email: caijinping220@gmail.com. Using the Geforce 103m graphics card on his old notebook, although the graphics card is already very weak relative to the current mainstream series, it is still available for learning. This series of posts also follows the process of documenting your learning from simplicity to complexity. 0. Directory GPU Programming primer to Proficient (i) CUDA environment installation GPU Programming primer to Proficient (ii) running the first program GPU programming to Master (iii) the first GPU program to master GPU programming to Proficient (iv) GPU program excellent Introduction to GPU programming to proficient (v) GPU Program optimization advanced 1. Array squared and parallelization
Getting started with GPU programming to Master (iii) The first GPU program describes how to use CUDA5.5 to run a program on the GPU. Through the operation of the program, we see that the GPU can indeed be used as an operator, but, in the previous example, we do not really play the ability of the GPU parallel processing program, that is to say, the previous examples only use a GPU thread, did not play the parallelism of the program.
First, the architecture of the GPU in CUDA5.5. It is composed of grid, each grid can be composed of block, and each block can be subdivided into thread. So, the thread is the smallest unit we're dealing with.
The next example, by modifying the previous example, divides the array into groups (each of which is implemented by one thread), each group computes one and the other, and then in the CPU the grouped ones are added together to get the final result. This idea is called a reduction . In fact, similar to the idea of divide-and-conquer, it is first to decompose large scale problems into small-scale problems, and finally, these small-scale problems can be integrated into the final solution.
Because the maximum number of threads in the block supported by my GPU is 512, that is, the Maxthreadsperblock attribute in Cudagetdeviceproperties. How to get this property, see the introduction to GPU programming to Master (ii) the chapter on running the first program. We use 512 threads to achieve parallel acceleration.
Well, then it's time to write a program. 1.1. Modify Code
First, add a macro definition of the number of threads in the program header:
======== Define area ========
#define DATA_SIZE 1048576//1M
#define THREAD_NUM//THREAD NUM
Where data_size represents the number of processed data, Thread_num says we're going to use 512 threads.
Second, modify the kernel function of the GPU part
const int size = Data_size/thread_num;
const int TID = threadidx.x;
int tmp_sum = 0;
for (int i = tid * size; I < (tid + 1) * size; i++) {
tmp_sum + = data[i] * data[i];
}
Sum[tid] = tmp_sum;
}
The purpose of this kernel program is to allocate the input data to 512 threads to compute parts and , and 512 parts and to the sum array, and finally to 512 parts and sums in the CPU to get the final result.
Here's how the data is traversed please note that we give each thread in order, as shown in the following table:
Thread Number |
Data | Subscript
0 |
0 ~ 2047 |
... ... |
... ... |
511 |
1046528 ~ 1048575 |
Then, modify the main function section
In the main function section, you can simply change the sum to a group and set the way to call the GPU kernel function.
malloc space for datas in GPU
cudamalloc ((void**) &sum, sizeof (int) * thread_num);
Calculate the squares ' s sum
squaressum<<<1, Thread_num, 0>>> (Gpudata, sum, time);
Finally, add part and sum code to the CPU
Print result
int tmp_result = 0;
for (int i = 0; i < Thread_num ++i) {
Tmp_result + = Result[i];
}
printf ("(GPU) sum:%d time:%ld\n", Tmp_result, time_used);
1.2. Compile and run
After compiling, the results of the run are as follows:
2. Performance Analysis
After the modified program, faster than the previous 36 times-fold (can refer to the Bowen GPU programming to Master (iii) the first GPU program to compare), can be seen in parallel processing has advantages. but think about it, we use 512 threads, but how performance has been promoted 36 times times, should not be 512 times times it ...
Here is the memory access mode, the memory card above is DRAM, is the most efficient access, it is a continuous way of access. In front of our program is indeed continuous reading Ah, are read one by one, how still did not achieve the desired effect ...
Here you also need to consider how the thread is executed, and the first GPU program to master (iii) in GPU programming, when a thread is waiting for memory data, the GPU switches to the next thread. Therefore, the actual execution order is similar to the thread0-> thread1-> ... ...-> thread511.
This leads to the same thread in the read memory is continuous, but for the overall, the execution of the process of reading is not continuous (here to think about it, you know). So, the correct approach is shown in the following table:
Thread Number |
Data | Subscript
0 |
0 ~ 512 |
... ... |
... ... |
511 |
511 ~ 1023 |
According to this principle, modify the kernel functions as follows:
for (int i = tid i < data_size i = thread_num) {
tmp_sum + = data[i] * Data[i];
The results of the compile run are as follows:
The modified program, 13 times times faster than before, visible, the memory of the way to read the performance of a large impact.
At this point, the parallelization of the program is not parallel before the program, the speed is about 493 times times faster, visible, basically played the advantage of 512 threads.
Let's analyze the performance again:
This GPU consumes a clock cycle: 1595788 cycles
GeForce G 103M clockrate:1.6 GHz
so you can calculate the running time on the GPU: clock cycle/clockrate = 997.3675 US
1 m int data has 4 m Byte of data volume, the actual GPU memory bandwidth is: Data volume/run time = 4.01 GB/s
Take a look at my GPU GeForce 103m memory Bandwidth: Run the SDK directory below/samples/1_utilities/bandwidthtest
The results of the run are as follows:
By comparing with the system parameters, we can know that the ultimate performance of the system is basically reached.
This blog post describes how to optimize the program by using threads to achieve parallel computing, and by optimizing the way memory reads. Through this program, you can learn to use the general process of CUDA threads. The next section will further analyze some of the details that the program can optimize.
Welcome to discuss and learn about GPU programming with me.
Caijinping220@gmail.com
Http://blog.csdn.net/xsc_c