Based on years of Cuda development experience, we will briefly introduce the general development steps of the Cuda program, and follow the principle of first modifying the CPU serial program and then porting it to the GPU platform, modify the work that needs to be done on the GPU as much as possible on the CPU platform, reducing the difficulty of Program Development and debugging with bugs. By implementing a fast and effective method for developing Cuda parallel programs, the efficiency of developing Cuda parallel programs is improved, and the development cycle and difficulty of Cuda parallel programs are reduced.
(1) CPU serial Program Analysis
For a CPU serial program, you must first test the hotspot functions in the serial program and analyze the parallelism of the hotspot functions:
A) Hot Spot Test
The hotspot function is determined based on the test results of the time and serves as the key code module to be transplanted later.
B) Parallel Analysis
After finding out the hotspot code, you need to analyze the algorithm and data features of the hotspot part, and analyze whether the algorithm can be parallel based on the characteristics of the algorithm and data, and whether it is suitable for fine-grained parallel processing.
C) Determine the array used by the Cuda Kernel
Based on the analysis of the serial program, determine which modules need to be transplanted to the GPU platform for running, and analyze the data in the code that needs to run on the GPU platform, determine which arrays need to be used in the Cuda kernel. During analysis of Cuda computing, these arrays are transmitted in the direction of cputogpu, gputocpu, and the data size during each transmission, then design the definition and size of these arrays.
(2) A cuda-like Serial CPU program
Cuda programs are more complex than CPU programs. When a bug occurs, debugging is much more difficult than CPU programs. To reduce the difficulty and cycle of Cuda program development, some GPU porting work can be implemented on the CPU platform in advance, which involves the following aspects:
A) modification to a parallel algorithm
For CPU serial programs, some code can be parallel in theory, but after optimization of the CPU version, the Code cannot be directly parallel. In this case, you need to modify the original program according to the requirements of the parallel algorithm, this mode can be changed to a parallel mode. Some modules can be parallel in theory, but serial algorithms cannot be directly parallelized. We need to re-design parallel algorithms.
B) array Modification
The array definition used in the CPU serial program may not be directly used in the Cuda kernel. In this case, you need to modify the array definition, such as the C language program, the pointer in the struct must be changed to a separate pointer/array to transfer data between the CPU and GPU. In addition, considering the combined access of global memory, you sometimes need to modify the access direction of the array, and then change the definition form of the array (such as row and column conversion ). In short, according to the difference between Cuda's array usage and the CPU serial program, modify the array in advance to facilitate program debugging.
According to the previous modification methods, the original CPU program is modified into a cuda-like Serial CPU program, which makes a lot of preparations for subsequent transplantation, and facilitates the transplantation of subsequent Cuda programs.
(3) GPU Array Design and parallel Model Design
GPU Array Design: Design GPU array size, type, dimension, and other information;
Design the array communication mode between CPU and GPU;
Parallel model design: the design of block and grid meets the data characteristics of algorithms.
(4) Basic version of Cuda parallel program
Based on the analysis of the original program array, the CPU serial program is transplanted to the GPU platform, and the Cuda kernel code is implemented based on the algorithm and code of the hotspot module.
A) design the Call Statement
Kernel <grid, block,…> (...);
B) design the Cuda Kernel
Based on the parallel analysis of the algorithm, the kernel is designed to divide the computing tasks of each thread, and the logic correctness of the kernel program is satisfied by the synchronous statement.
(5) optimized Cuda parallel program version
Based on the basic version of Cuda parallel program implemented in Step 4, The Cuda optimization technology is used to further improve the performance of parallel programs. The main optimization includes two aspects: Communication Optimization and kernel optimization.
A) CPU and GPU Communication Optimization
GPU computing requires data transmission between the CPU and the GPU. Reasonable Use of communication optimization technology can improve the performance of GPU parallel programs, such as stream technology to hide the communication between the CPU and GPU.
B) Cuda kernel Optimization
The optimization of Cuda kernel is more important to its performance, mainly involving memory access optimization and command flow optimization. Memory Access optimization includes: global memory combined access, use shared storage, constant storage, and texture storage to replace access to the global storage to improve access speed. Command Flow Optimization refers to the use of efficient commands to replace inefficient commands, such as fast functions in Cuda.
The above method is just a rough introduction. It is inevitable that there will be incomplete introductions. I hope it will be helpful for your development.
Rapid development of Cuda programs